Sql-server – Filtering data ordered by rowversion

indexindex-tuningsql server

I have a SQL table of data with the following structure :

CREATE TABLE Data(
    Id uniqueidentifier NOT NULL,
    Date datetime NOT NULL,
    Value decimal(20, 10) NULL,
    RV timestamp NOT NULL,
 CONSTRAINT PK_Data PRIMARY KEY CLUSTERED (Id, Date)
)

The number of distinct Ids ranges from 3000 to 50000.
The size of the table varies up to over a billion rows.
One Id can cover between a few rows up to 5% of the table.

The single most executed query on this table is :

SELECT Id, Date, Value, RV
FROM Data
WHERE Id = @Id
AND Date Between @StartDate AND @StopDate

I now have to implement incremental retrieval of data on a subset of Ids, including updates.
I then used a request scheme in which the caller provide a specific rowversion, retrieve a block of data and use the maximum rowversion value of the returned data for the subsequent call.

I have written this procedure :

CREATE TYPE guid_list_tbltype AS TABLE (Id uniqueidentifier not null primary key)

CREATE PROCEDURE GetData
    @Ids guid_list_tbltype READONLY,
    @Cursor rowversion,
    @MaxRows int
AS
BEGIN
    SELECT A.* 
    FROM (
        SELECT 
            Data.Id,
            Date,
            Value,
            RV,
            ROW_NUMBER() OVER (ORDER BY RV) AS RN
        FROM Data
             inner join (SELECT Id FROM @Ids) Ids ON Ids.Id = Data.Id
        WHERE RV > @Cursor
    ) A 
    WHERE RN <= @MaxRows
END

Where @MaxRows will range between 500,000 and 2,000,000 depending on how chunked the client will want his data.

I have tried different approaches :

Indexing on (Id, RV) :

    CREATE NONCLUSTERED INDEX IDX_IDRV ON Data(Id, RV) INCLUDE(Date, Value);

Using the index, the query seek the rows where RV = @Cursor for each Id in @Ids, read the following rows then merge the result and sort.
Efficiency then depends on the relative position of @Cursor value.
If it is close to the end of data (ordered by RV) the query is instantaneous and if not the query can take up to minutes (never let it run to the end).

the problem with this approach is that @Cursor is either near the end of data and the sort isn't painful (not even needed if the query return less rows than @MaxRows) either it is further behind and the query has to sort @MaxRows * LEN(@Ids) rows.

Indexing on RV :

    CREATE NONCLUSTERED INDEX IDX_RV ON Data(RV) INCLUDE(Id, Date, Value);

Using the index, the query seek the row where RV = @Cursor then read every row discarding the non-requested Ids until it reaches @MaxRows.
Efficiency then depends on the % of requested Ids (LEN(@Ids) / COUNT(DISTINCT Id)) and their distribution.
More requested Id % means less discarded rows which means more efficient reads, less requested Id % means more discarded rows which means more reads for the same amount of resulting rows.

The problem with this approach is that if the requested Ids contains only a few elements, it might have to read the whole index to get the desired rows.

Using Filtered index or indexed views

    CREATE NONCLUSTERED INDEX IDX_RVClient1 ON Data(Id, RV) INCLUDE(Date, Value)
    WHERE Id IN (/* list of Ids for specific client*/);

    CREATE VIEW vDataClient1 WITH SCHEMABINDING
    AS
    SELECT
        Id,
        Date,
        Value,
        RV
    FROM dbo.Data
    WHERE Id IN (/* list of Ids for specific client*/)

    CREATE UNIQUE CLUSTERED INDEX IDX_IDRV ON vDataClient1(Id, Rv);

This method allows for perfectly efficient indexing and query execution plans but come with disadvantages :
1. Practically, I will have to implement dynamic SQL to create the indexes or views and modify the requesting procedure to use the right index or view.
2. I will have to maintain one index or view by existing client, including storage.
3. Every time a client will have to modify his list of requested Ids, I will have to drop the index or view and recreate it.

I can't seem to find a method that will suit my needs.
I'm looking for better ideas to implement incremental data retrieval.
Those ideas could imply reworking the requesting scheme or the database schema although I'd prefer a better indexing approach if there is one.

Best Answer

One solution is for the client application to remember the maximum rowversion per ID. The user-defined table type would change to:

CREATE TYPE
    dbo.guid_list_tbltype
AS TABLE 
    (
    Id      uniqueidentifier PRIMARY KEY, 
    LastRV  rowversion NOT NULL
    );

The query in the procedure can then be rewritten to use the APPLY pattern (see my SQLServerCentral articles part 1 and part 2 - free login required). The key to good performance here is the ORDER BY - it avoids unordered pre-fetching on the nested loops join. The RECOMPILE is necessary to allow the optimizer to see the cardinality of the table variable at compilation time (probably resulting in a desirable parallel plan).

ALTER PROCEDURE dbo.GetData

    @IDs        guid_list_tbltype READONLY,
    @MaxRows    bigint

AS
BEGIN

    SELECT TOP (@MaxRows)
        d.Id,
        d.[Date],
        d.Value,
        d.RV
    FROM @Ids AS i
    CROSS APPLY
    (
        SELECT
            d.*
        FROM dbo.Data AS d
        WHERE
            d.Id = i.Id
            AND d.RV > i.LastRV
    ) AS d
    ORDER BY
        i.Id,
        d.RV
    OPTION (RECOMPILE);

END;

You should get a post-execution query plan like this (estimated plan will be serial):

query plan

Related Solutions

Mysql – Adding index to large thesql tables

The activity of altering big tables are done in phases:

Create a new table with required fields and indexes say in test DB (just structure)
Dump the data from the existing table and load the same to the newly created table in test DB
Now announce your downtime :)
Swap the tables by renaming - RENAME table ur_db.table_name to test.temp, test.table_name to ur_db.table_name, test.temp to test.table_name; This is an atomic operation taking fraction of a second.
Load the extra records to the newly created table ( the records that came after dumping and then loading). This step can be done before Step: 3 also once to reduce your downtime.
And your system is back

Few notes:

You need not hit information schema directly like this, try using SHOW TABLE STATUS from db like 'table_name'
The speed of alter table is more or less linked with the I/O speed. Last time when we ran a direct alter table(without the above steps), we had 40GB+ table size it took around 4hours. If your 20GB data is taking years, you are working on some outdated machine.
Also drop of the unwanted indexes like index_bslt_ondate, index_base_schedule_line_items_loan_base_schedule as some other indexes has the left most column as the indexed column

Please let me know if you need any clarification on any of these steps.

Edit : A simple python script to automate the process https://github.com/georgecj11/hotswap_mysql_table.git

Postgresql – Can an index refer to only one or a subset of possible values

You can create a partial index that includes all columns of the table:

CREATE UNIQUE INDEX not_processed
    ON transactions 
        (tran_id, tran_type, tran_note, created_at)
    WHERE process_id = 0 ;

Best Answer

Related Solutions

Mysql – Adding index to large thesql tables

Postgresql – Can an index refer to only one or a subset of possible values

Related Question