SQL Server – Converting First 100 Million Positive Integers to Strings

performancequery-performancesql server

This is a bit of an diversion from the real problem. If providing context helps, generating this data could be useful for performance testing ways of processing strings, for generating strings which need to have some operation applied to them within a cursor, or for generating unique, anonymous name replacements for sensitive data. I'm just interested in efficient ways of generating the data within SQL Servers, please don't ask why I need to generate this data.

I'll try to start with a somewhat formal definition. A string is included in the series if it only consists of capital letters from A – Z. The first term of the series is "A". The series consists of all valid strings sorted by length first and typical alphabetical order second. If the strings were in a table in a column called STRING_COL, the order could be defined in T-SQL as ORDER BY LEN(STRING_COL) ASC, STRING_COL ASC.

To give a less formal definition take a look at alphabetical column headers in excel. The series is the same pattern. Consider how you might convert an integer to a base 26 number:

1 -> A, 2 -> B, 3 -> C, … , 25 -> Y, 26 -> Z, 27 -> AA, 28 -> AB, …

The analogy isn't quite perfect because "A" behaves differently than 0 in base ten. Below is a table of selected values that will hopefully make it more clear:

╔════════════╦════════╗
║ ROW_NUMBER ║ STRING ║
╠════════════╬════════╣
║          1 ║ A      ║
║          2 ║ B      ║
║         25 ║ Y      ║
║         26 ║ Z      ║
║         27 ║ AA     ║
║         28 ║ AB     ║
║         51 ║ AY     ║
║         52 ║ AZ     ║
║         53 ║ BA     ║
║         54 ║ BB     ║
║      18278 ║ ZZZ    ║
║      18279 ║ AAAA   ║
║     475253 ║ ZZZY   ║
║     475254 ║ ZZZZ   ║
║     475255 ║ AAAAA  ║
║  100000000 ║ HJUNYV ║
╚════════════╩════════╝

The goal is to write a SELECT query that returns the first 100000000 strings in the order defined above. I did my testing by running queries in SSMS with the result set discarded as opposed to saving it to a table:

Ideally the query will be reasonably efficient. Here I'm defining efficient as cpu time for a serial query and elapsed time for a parallel query. You may use whatever undocumented tricks that you like. Relying on undefined or non-guaranteed behavior is okay as well but it would be appreciated if you call that out in your answer.

What are some methods of efficiently generating the data set described above? Martin Smith pointed out that a CLR stored procedure probably isn't a good approach due to the overhead of processing so many rows.

Best Answer

Your solution runs for 35 seconds on my laptop. The following code takes 26 seconds (including creating and populating the temporary tables):

Temporary tables

DROP TABLE IF EXISTS #T1, #T2, #T3, #T4;

CREATE TABLE #T1 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T2 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T3 (string varchar(6) NOT NULL PRIMARY KEY);
CREATE TABLE #T4 (string varchar(6) NOT NULL PRIMARY KEY);

INSERT #T1 (string)
VALUES
    ('A'), ('B'), ('C'), ('D'), ('E'), ('F'), ('G'),
    ('H'), ('I'), ('J'), ('K'), ('L'), ('M'), ('N'),
    ('O'), ('P'), ('Q'), ('R'), ('S'), ('T'), ('U'),
    ('V'), ('W'), ('X'), ('Y'), ('Z');

INSERT #T2 (string)
SELECT T1a.string + T1b.string
FROM #T1 AS T1a, #T1 AS T1b;

INSERT #T3 (string)
SELECT #T2.string + #T1.string
FROM #T2, #T1;

INSERT #T4 (string)
SELECT #T3.string + #T1.string
FROM #T3, #T1;

The idea there is to pre-populate ordered combinations of up to four characters.

Main code

SELECT TOP (100000000)
    UA.string + UA.string2
FROM
(
    SELECT U.Size, U.string, string2 = '' FROM 
    (
        SELECT Size = 1, string FROM #T1
        UNION ALL
        SELECT Size = 2, string FROM #T2
        UNION ALL
        SELECT Size = 3, string FROM #T3
        UNION ALL
        SELECT Size = 4, string FROM #T4
    ) AS U
    UNION ALL
    SELECT Size = 5, #T1.string, string2 = #T4.string
    FROM #T1, #T4
    UNION ALL
    SELECT Size = 6, #T2.string, #T4.string
    FROM #T2, #T4
) AS UA
ORDER BY 
    UA.Size, 
    UA.string, 
    UA.string2
OPTION (NO_PERFORMANCE_SPOOL, MAXDOP 1);

That is a simple order-preserving union* of the four precalculated tables, with 5-character and 6-character strings derived as needed. Separating the prefix from the suffix avoids sorting.

Execution plan

* There's nothing in the SQL above that specifies an order-preserving union directly. The optimizer chooses physical operators with properties that match the SQL query specification, including top-level order by. Here, it chooses concatenation implemented by the merge join physical operator to avoid sorting.

The guarantee is that the execution plan delivers the query semantic and top-level order by specification. Knowing that merge join concat preserves order allows the query writer to anticipate an execution plan, but the optimizer will only deliver if the expectation is valid.

Related Solutions

Using an RDBMS for querying tens of Terabytes of time-series data

Answering this from one point of view: SQL Server 2012 columnstores could easily handle this. I've seen it working and once segment elimination and batch processing kick in, those TBs are reduced to very few actual IOs an the results get churned in few ms (ie. most data is eliminated upfront and there is simply no need to scan 100s of TBs). The queries you asked about are exactly the kind columnstores are designed for. This is such an efficient storage/processing paradigm that there's simply no need for distributed computing, even at hundreds of TB of data.

Mysql – How to use and optimize subquery on 100 million rows

Since is_keyword_checked can take only two possible values, 0 or 1 (and it's not nullable), you can try the rewriting. There is no OR and the index on (is_keyword_checked) will be used:

SELECT * 
FROM domain
WHERE domain_ID NOT IN 
      ( SELECT domain_ID 
        FROM domain_setting 
        WHERE is_keyword_checked = 1
      ) ;

or with a NOT EXISTS subquery - which should have similar execution plan:

SELECT d.* 
FROM domain AS d
WHERE NOT EXISTS 
      ( SELECT * 
        FROM domain_setting AS ds
        WHERE ds.is_keyword_checked = 1
          AND ds.domain_ID = d.domain_ID
      ) ;