I have a 1 TB string of digits. Given a 12-character sequence of digits I want to get the start-position of this sequence in the original string (charindex
function).
I have tested this with a 1GB string and a 9-digit-substring using SQL Server, storing the string as a varchar(max)
. Charindex
takes 10 secs. Breaking up the 1GB string in 900 byte overlapping chunks and creating a table (StartPositionOfChunk, Chunkofstring) with chunkofstring in binary collation, indexed takes under 1 sec. Latter method for 10GB,10 digit-substring rises charindex to 1,5 min. I would like to find a faster storage method.
Example
string of digits: 0123456789 – substring to search 345
charindex('345','0123456789') gives 4
Method 1: I can now store this in a SQL Server table strtable consisting of one column colstr
and perform:
select charindex('345',colstr) from strtable
Method 2: or I can make up a table strtable2 (pos,colstr1) by splitting up the original string: 1;012 | 2;123 | 3;234 a.s.o and then we can have the query
select pos from strtable2 where colstr1='345'
Method 3: I can make up a table strtable2 (pos2,colstr2) by splitting up the original string into larger chunks 1;01234 | 4;34567 | 7;6789 and then
select pos2+charindex('345',colstr2) from strtable2 where colstr2 like '%345%'
First method is the slowest.
Second method blows up the database storage size!
Method 3: Setting colstr2 length to 900 bytes in binary collation, creating an index on this column takes 1 sec for 1GB string and 9 digit substring search. For 10GB string and 10 digit substring ist takes 90 secs.
Any other idea how to make this faster (maybe by utilizing the string consists of Digits, with Long integers,….)?
Search is always for a 12 digit substring in a 1TB string of digits SQL Server 2017 Developer Edition, 16 cores, 16GB RAM. Primary goal is search speed! 10 digits in a 10GB string (for performance testing).
Best Answer
I recommend using a flavor of method 2 and splitting the search range into many target tables. 10000 tables is a good first attempt. For example, if you search for "012345678901" then your query would look at the table associated with data that starts with "0123". You would still have about a trillion rows in total, but splitting the data into many tables has the following positive qualities:
At this point, the main question to me is whether you use compressed rowstore or columnstore. The code below creates a rowstore table for the "0123" search space and inserts 100 million rows into it. If your string is random enough then you could also expect to see about 100 million rows per table.
The bad news is for the full data set you'd probably need around 15.4 TB. The good news is that queries only take 1 ms for me even when no relevant data is in the buffer cache, which will almost always be the case for a data set as large as yours.
You can probably throw this data on the cheapest storage that you have and still see good response times since the query does so few logical reads.
For columnstore, you can't seek to the data that you need and you're still extremely unlikely to fit all of your data into memory, so it's important to read as little compressed data as possible with your queries. I strongly recommend partitioning your tables. One simple approach that works well is to use the first four digits of your search string to find the table name and the next two digits as the partition. Using "012345678901" again, you would go to partition 45 of the table that holds data for "0123". 100 partitions is a good number to avoid problems caused by too many partitions and you'll end up with about 1 million rows for each partition on average. The maximum number of rows that can fit into a single rowgroup is 1048576, so with this approach you'll be doing as little IO as possible. In the code below, I insert 100 M rows into a columnstore table that has 100 partitions.
With this approach the full data set would require about 10.9 TB. It's not clear to me how to make that smaller. The search query is a bit slower in this case. On my machine it takes about 25 ms, but this will mostly depend on IO:
One important note on the columnstore approach is that the 10.9 TB figure is for 100% compressed data. It will be challenging to populate such a table efficiently while avoiding the delta stores. It's likely that you'll end up with uncompressed data in delta stores at some point in the process which could easily require more than the 15.4 TB used for the rowstore approach.