SQL Server – Converting First 100 Million Positive Integers to Strings

performancequery-performancesql server

This is a bit of an diversion from the real problem. If providing context helps, generating this data could be useful for performance testing ways of processing strings, for generating strings which need to have some operation applied to them within a cursor, or for generating unique, anonymous name replacements for sensitive data. I'm just interested in efficient ways of generating the data within SQL Servers, please don't ask why I need to generate this data.

I'll try to start with a somewhat formal definition. A string is included in the series if it only consists of capital letters from A – Z. The first term of the series is "A". The series consists of all valid strings sorted by length first and typical alphabetical order second. If the strings were in a table in a column called STRING_COL, the order could be defined in T-SQL as ORDER BY LEN(STRING_COL) ASC, STRING_COL ASC.

To give a less formal definition take a look at alphabetical column headers in excel. The series is the same pattern. Consider how you might convert an integer to a base 26 number:

1 -> A, 2 -> B, 3 -> C, … , 25 -> Y, 26 -> Z, 27 -> AA, 28 -> AB, …

The analogy isn't quite perfect because "A" behaves differently than 0 in base ten. Below is a table of selected values that will hopefully make it more clear:

╔════════════╦════════╗
║ ROW_NUMBER ║ STRING ║
╠════════════╬════════╣
║          1 ║ A      ║
║          2 ║ B      ║
║         25 ║ Y      ║
║         26 ║ Z      ║
║         27 ║ AA     ║
║         28 ║ AB     ║
║         51 ║ AY     ║
║         52 ║ AZ     ║
║         53 ║ BA     ║
║         54 ║ BB     ║
║      18278 ║ ZZZ    ║
║      18279 ║ AAAA   ║
║     475253 ║ ZZZY   ║
║     475254 ║ ZZZZ   ║
║     475255 ║ AAAAA  ║
║  100000000 ║ HJUNYV ║
╚════════════╩════════╝

The goal is to write a SELECT query that returns the first 100000000 strings in the order defined above. I did my testing by running queries in SSMS with the result set discarded as opposed to saving it to a table:

discard result set

Ideally the query will be reasonably efficient. Here I'm defining efficient as cpu time for a serial query and elapsed time for a parallel query. You may use whatever undocumented tricks that you like. Relying on undefined or non-guaranteed behavior is okay as well but it would be appreciated if you call that out in your answer.

What are some methods of efficiently generating the data set described above? Martin Smith pointed out that a CLR stored procedure probably isn't a good approach due to the overhead of processing so many rows.

Best Answer

Your solution runs for 35 seconds on my laptop. The following code takes 26 seconds (including creating and populating the temporary tables):

Temporary tables

DROP TABLE IF EXISTS #T1, #T2, #T3, #T4;

CREATE TABLE #T1 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T2 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T3 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T4 (string varchar(6) NOT NULL PRIMARY KEY);

INSERT #T1 (string)
VALUES
    ('A'), ('B'), ('C'), ('D'), ('E'), ('F'), ('G'),
    ('H'), ('I'), ('J'), ('K'), ('L'), ('M'), ('N'),
    ('O'), ('P'), ('Q'), ('R'), ('S'), ('T'), ('U'),
    ('V'), ('W'), ('X'), ('Y'), ('Z');

INSERT #T2 (string)
SELECT T1a.string + T1b.string
FROM #T1 AS T1a, #T1 AS T1b;

INSERT #T3 (string)
SELECT #T2.string + #T1.string
FROM #T2, #T1;

INSERT #T4 (string)
SELECT #T3.string + #T1.string
FROM #T3, #T1;

The idea there is to pre-populate ordered combinations of up to four characters.

Main code

SELECT TOP (100000000)
    UA.string + UA.string2
FROM
(
    SELECT U.Size, U.string, string2 = '' FROM 
    (
        SELECT Size = 1, string FROM #T1
        UNION ALL
        SELECT Size = 2, string FROM #T2
        UNION ALL
        SELECT Size = 3, string FROM #T3
        UNION ALL
        SELECT Size = 4, string FROM #T4
    ) AS U
    UNION ALL
    SELECT Size = 5, #T1.string, string2 = #T4.string
    FROM #T1, #T4
    UNION ALL
    SELECT Size = 6, #T2.string, #T4.string
    FROM #T2, #T4
) AS UA
ORDER BY 
    UA.Size, 
    UA.string, 
    UA.string2
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 1);

That is a simple order-preserving union* of the four precalculated tables, with 5-character and 6-character strings derived as needed. Separating the prefix from the suffix avoids sorting.

Execution plan

100 million rows


* There's nothing in the SQL above that specifies an order-preserving union directly. The optimizer chooses physical operators with properties that match the SQL query specification, including top-level order by. Here, it chooses concatenation implemented by the merge join physical operator to avoid sorting.

The guarantee is that the execution plan delivers the query semantic and top-level order by specification. Knowing that merge join concat preserves order allows the query writer to anticipate an execution plan, but the optimizer will only deliver if the expectation is valid.