Sql-server – Fastest way to split/store a long string for charindex function

physical-designsql serversql-server-2017string-searchingstring-splitting

I have a 1 TB string of digits. Given a 12-character sequence of digits I want to get the start-position of this sequence in the original string (charindex function).

I have tested this with a 1GB string and a 9-digit-substring using SQL Server, storing the string as a varchar(max). Charindex takes 10 secs. Breaking up the 1GB string in 900 byte overlapping chunks and creating a table (StartPositionOfChunk, Chunkofstring) with chunkofstring in binary collation, indexed takes under 1 sec. Latter method for 10GB,10 digit-substring rises charindex to 1,5 min. I would like to find a faster storage method.

Example

string of digits: 0123456789 – substring to search 345
charindex('345','0123456789') gives 4

Method 1: I can now store this in a SQL Server table strtable consisting of one column colstr and perform:

select charindex('345',colstr) from strtable

Method 2: or I can make up a table strtable2 (pos,colstr1) by splitting up the original string: 1;012 | 2;123 | 3;234 a.s.o and then we can have the query

select pos from strtable2 where colstr1='345'

Method 3: I can make up a table strtable2 (pos2,colstr2) by splitting up the original string into larger chunks 1;01234 | 4;34567 | 7;6789 and then

select pos2+charindex('345',colstr2) from strtable2 where colstr2 like '%345%'

First method is the slowest.

Second method blows up the database storage size!

Method 3: Setting colstr2 length to 900 bytes in binary collation, creating an index on this column takes 1 sec for 1GB string and 9 digit substring search. For 10GB string and 10 digit substring ist takes 90 secs.

Any other idea how to make this faster (maybe by utilizing the string consists of Digits, with Long integers,….)?

Search is always for a 12 digit substring in a 1TB string of digits SQL Server 2017 Developer Edition, 16 cores, 16GB RAM. Primary goal is search speed! 10 digits in a 10GB string (for performance testing).

Best Answer

I recommend using a flavor of method 2 and splitting the search range into many target tables. 10000 tables is a good first attempt. For example, if you search for "012345678901" then your query would look at the table associated with data that starts with "0123". You would still have about a trillion rows in total, but splitting the data into many tables has the following positive qualities:

  1. All possible 12 digit strings can now fit into an INT.
  2. Building a more search efficient representation of your 1 TB string is likely going to be expensive no matter what. With many tables, you can easily parallelize the job and even temporarily ask for more CPUs to be allocated to your VM.
  3. You can build a single table as a proof of concept to determine query time and total space requirements for the full string.
  4. If you ever need to do any kind of database maintenance you'll be happy that you didn't create one huge table.

At this point, the main question to me is whether you use compressed rowstore or columnstore. The code below creates a rowstore table for the "0123" search space and inserts 100 million rows into it. If your string is random enough then you could also expect to see about 100 million rows per table.

DROP TABLE IF EXISTS #t;

SELECT TOP (10000) 0 ID INTO #t
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS dbo.Q229892_RAW_100M_RANGE;

CREATE TABLE dbo.Q229892_RAW_100M_RANGE (
STRING_PIECE INT NOT NULL,
STR_POS BIGINT NOT NULL
);

INSERT INTO dbo.Q229892_RAW_100M_RANGE WITH (TABLOCK)
SELECT ABS(CHECKSUM(NEWID()) % 100000000),
TRY_CAST(ABS(CHECKSUM(NEWID())) AS BIGINT) * TRY_CAST(ABS(CHECKSUM(NEWID())) AS BIGINT)
FROM #t t1
CROSS JOIN #t t2
OPTION (MAXDOP 4);


DROP TABLE IF EXISTS dbo.T0123_Q229892_PAGE_COMPRESSION;

CREATE TABLE dbo.T0123_Q229892_PAGE_COMPRESSION (
    STRING_PIECE INT NOT NULL,
    STR_POS BIGINT NOT NULL,
    PRIMARY KEY (STRING_PIECE, STR_POS)
) WITH (DATA_COMPRESSION = PAGE);

INSERT INTO dbo.T0123_Q229892_PAGE_COMPRESSION WITH (TABLOCK)
SELECT STRING_PIECE, STR_POS
FROM dbo.Q229892_RAW_100M_RANGE;

The bad news is for the full data set you'd probably need around 15.4 TB. The good news is that queries only take 1 ms for me even when no relevant data is in the buffer cache, which will almost always be the case for a data set as large as yours.

-- 1 ms
CHECKPOINT;
DBCC DROPCLEANBUFFERS;

SELECT MIN(STR_POS)
FROM dbo.T0123_Q229892_PAGE_COMPRESSION
WHERE STRING_PIECE = 45678901; -- searching for '012345678901'

You can probably throw this data on the cheapest storage that you have and still see good response times since the query does so few logical reads.

For columnstore, you can't seek to the data that you need and you're still extremely unlikely to fit all of your data into memory, so it's important to read as little compressed data as possible with your queries. I strongly recommend partitioning your tables. One simple approach that works well is to use the first four digits of your search string to find the table name and the next two digits as the partition. Using "012345678901" again, you would go to partition 45 of the table that holds data for "0123". 100 partitions is a good number to avoid problems caused by too many partitions and you'll end up with about 1 million rows for each partition on average. The maximum number of rows that can fit into a single rowgroup is 1048576, so with this approach you'll be doing as little IO as possible. In the code below, I insert 100 M rows into a columnstore table that has 100 partitions.

DROP TABLE IF EXISTS dbo.Q229892_RAW_1M_RANGE;

CREATE TABLE dbo.Q229892_RAW_1M_RANGE (
STRING_PIECE INT NOT NULL,
STR_POS BIGINT NOT NULL
);

INSERT INTO dbo.Q229892_RAW_1M_RANGE WITH (TABLOCK)
SELECT TOP (1000000) ABS(CHECKSUM(NEWID()) % 1000000),
TRY_CAST(ABS(CHECKSUM(NEWID())) AS BIGINT) * TRY_CAST(ABS(CHECKSUM(NEWID())) AS BIGINT)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);



DECLARE @IntegerPartitionFunction nvarchar(max) = 
    N'CREATE PARTITION FUNCTION partition100 (tinyint) 
    AS RANGE LEFT FOR VALUES (';  
DECLARE @i int = 0;  
WHILE @i < 100
BEGIN  
SET @IntegerPartitionFunction += CAST(@i as nvarchar(10)) + N', ';  
SET @i += 1;  
END  
SET @IntegerPartitionFunction += CAST(@i as nvarchar(10)) + N');';  
EXEC sp_executesql @IntegerPartitionFunction;  
GO  

CREATE PARTITION SCHEME partition100_scheme
AS PARTITION partition100  
ALL TO ([DEFAULT]);

DROP TABLE IF EXISTS dbo.T0123_Q229892_COLUMNSTORE;

-- this table must be partitioned by PART_ID!
CREATE TABLE dbo.T0123_Q229892_COLUMNSTORE (
    PART_ID TINYINT NOT NULL,
    STRING_PIECE INT NOT NULL,
    STR_POS BIGINT NOT NULL,
    INDEX CCS CLUSTERED COLUMNSTORE
) ON partition100_scheme (PART_ID);


GO

DECLARE @part_id TINYINT = 0;
SET NOCOUNT ON;
WHILE @part_id < 100
BEGIN
    INSERT INTO dbo.T0123_Q229892_COLUMNSTORE WITH (TABLOCK)
    SELECT @part_id, STRING_PIECE, STR_POS
    FROM dbo.Q229892_RAW_1M_RANGE
    OPTION (MAXDOP 1);

    SET @part_id = @part_id + 1;
END;

GO

With this approach the full data set would require about 10.9 TB. It's not clear to me how to make that smaller. The search query is a bit slower in this case. On my machine it takes about 25 ms, but this will mostly depend on IO:

CHECKPOINT;
DBCC DROPCLEANBUFFERS;

SELECT MIN(STR_POS)
FROM dbo.T0123_Q229892_COLUMNSTORE
WHERE PART_ID = 45
AND STRING_PIECE = 678901; -- searching for '012345678901'

One important note on the columnstore approach is that the 10.9 TB figure is for 100% compressed data. It will be challenging to populate such a table efficiently while avoiding the delta stores. It's likely that you'll end up with uncompressed data in delta stores at some point in the process which could easily require more than the 15.4 TB used for the rowstore approach.