THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Merrill Aldrich

Random Join, STAT

I've been tinkering with a little puzzle, mainly for fun: suppose you need to generate a large number of "mock" names either for generating test data or for masking data that's restored from a production system into a test environment. For this task I have a set of tables from the US Census web site, that contain, each separately, the most common first and last names of men and women from the 1990 census. So here's the puzzle: how can I generate 1,000,000 random first and last name combinations from those tables, in a reasonable time.

I loaded the lists into SQL Server into simple tables:

CREATE TABLE [dbo].[femalefirstnames](
   [name] [nvarchar](14) NOT NULL,
 CONSTRAINT [pk_femalefirstnames] PRIMARY KEY CLUSTERED (
   [name] ASC
   )
);
GO

CREATE TABLE [dbo].[lastnames](
   [name] [nvarchar](14) NOT NULL,
 CONSTRAINT [pk_lastnames] PRIMARY KEY CLUSTERED (
   [name] ASC
   )
);
GO

CREATE TABLE [dbo].[malefirstnames](
   [name] [nvarchar](14) NOT NULL,
 CONSTRAINT [pk_malefirstnames] PRIMARY KEY CLUSTERED (
   [name] ASC
   )
);
GO

A naive approach -- always a place to start -- would be simply to cross-join the last names against the first names, and pick the top 'x' results. This does work, and luckily the TOP clause will optimize so that SQL Server doesn't materialize the entire cross-join set:

SELECT TOP 1000000 
   [last].name AS lastname,
   [first].name AS firstname
FROM dbo.lastnames AS [last]
CROSS JOIN ( 
   SELECT name FROM dbo.femalefirstnames 
   UNION ALL 
   SELECT name FROM dbo.malefirstnames) AS [first];

The issue is that this is unlikely to produce a useful distribution of names. It tends to repeat the names on one side of the join for each name on the other, so, for example, you are very likely to get every single last name (about 100,000) paired with just one first name, then every last name paired with the second first name, and so on. It also is likely to produce the same set every time. In my test, in 1,000,000 names I had only 11 unique first names:

SELECT COUNT( DISTINCT firstname ) 
FROM
( SELECT TOP 1000000 
   [last].name AS lastname,
   [first].name AS firstname
FROM dbo.lastnames AS [last]
CROSS JOIN ( 
   SELECT name FROM dbo.femalefirstnames 
   UNION ALL 
   SELECT name FROM dbo.malefirstnames) AS [first]
) AS selectednames;

Yields just 11.

So, we need not just to cross join the lists but also to get SQL Server to randomize the pairings of names, even if it's less efficient for the optimizer. A second naive approach: cross the lists, but pick random pairs from the whole result set, rather than the first rows returned. This also works, but it takes a LOT of resources to produce the results, and it's very slow:

SELECT TOP 1000000 
   [last].name AS lastname,
   [first].name AS firstname
FROM dbo.lastnames AS [last]
CROSS JOIN ( 
   SELECT name FROM dbo.femalefirstnames 
   UNION ALL 
   SELECT name FROM dbo.malefirstnames) AS [first]
ORDER BY NEWID(); --< this is a bad idea; the server will have 
               -- to materialize and sort the whole cross-join set

This is slow and demanding because where in the previous example the database engine can produce just 1,000,000 rows in memory, return them and stop working, as soon as I add the Order By clause, the entire cross join has to be materialized and sorted in order to deliver the top 1,000,000 ordered rows. The cross is about 500,000,000 rows. I let it run for five or ten minutes and then cancelled it, knowing there had to be a better way :-).

So, the ideal scenario would be to get a more "random" set of joined pairs, but without materializing too many more than I really need. If sorting the entire cross join in random order is too expensive, sorting the separate base tables isn't so bad. If there were a way to create a join where one random row from the last names table links to just a few random first names, less data would be disarded than with the massive cross.

This is where the row_number() function comes to the rescue: if we create a randomly ordered set of last names and number them on the fly, and do the same with the first names, then it's pretty simple to join them using the row_number() sequences.

Here's a query that produces a randomly-numbered set of last names. The output is in a different order each time you run the query, but always numbered:

SELECT 
   name AS lastname, 
   row_number() OVER ( ORDER BY NEWID() ) randid
FROM dbo.lastnames;

Likewise for the set of first names:

SELECT 
   name AS firstname, 
   row_number() OVER ( ORDER BY NEWID() ) randid
FROM ( SELECT name FROM dbo.femalefirstnames
   UNION ALL
   SELECT name FROM dbo.malefirstnames
   ) AS allfirstnames;

To create a large number of combinations, I can just join the first set to the second, using a between clause to pair each last name with a subset of the first names:

SELECT TOP 1000000 lastname, firstname 
FROM ( 
   SELECT 
       name AS lastname, 
       row_number() OVER ( ORDER BY NEWID() ) randid
   FROM dbo.lastnames 
) AS  lastnames
INNER JOIN (
   SELECT 
       name AS firstname, 
       row_number() OVER ( ORDER BY NEWID() ) randid 
   FROM (
       SELECT name FROM dbo.femalefirstnames
       UNION ALL
       SELECT name FROM dbo.malefirstnames
   ) AS allfirstnames
) AS firstnames 
ON lastnames.randid BETWEEN firstnames.randid AND firstnames.randid + 999;

This makes a set of random pairs in about 10 seconds on my laptop. Not too bad!

 

Published Monday, July 20, 2009 10:27 AM by merrillaldrich

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Merrill Aldrich said:

I have just given a talk with my group at work on the basics of transaction isolation, dirty reads (nolock)

July 30, 2009 7:33 PM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement