SQL Server Pagination – Most Cost Efficient Way to Page Through a Poorly Ordered Table

pagingsql server

I have a table that has three columns:
HashUID1, HashUID2, Address_Name (which is a textual email address, and the previous two hash colunms are of some crazy creation to link event participant tables to email addresses. its ugly, it barely works its out of my control. Focus on the address_name index)

It has 78 million rows. Not properly sorted. Regardless, this index is split onto a lot of fast LUN's and performs REALLY fast index seeks.

I need to create a series of queries to extract only 20,000 "rows per page" at a time, but avoid conflicts or dupes. Since there is no identity column, or easily ordered column, is there an easy way to select all, and page through it?

Am I correct in saying that if I do a select * from hugetablewithemails into a temp table, then select through it by row_number that the table remains in memory for the duration of the transaction, which, to me, is an excessive amount of memory resources?
This seems the preferred method of paging. I'd rather page by statistical percentages. 🙁

There is one index which maintains the address_name email address in order, and is well maintained. For the past week I have been meaning to help this other developer by spending some time on looking into building a proc that spits out ranges based on windowing functions based on statistics (which I am not great at, but this query really interested me) to provide a range of characters 1 through (variable) LEFT LIKE chars of the index, that meets 20,000 rows–But I have not had time to even start the query…

Couple questions:

  1. Any suggestions? Not looking for actual code, just some hints or suggestions based on experiences, maybe caveats. I want to avoid additional index scans after the initial scan.

  2. Is this the right approach?

  3. I'm thinking of breaking the sum of the index of all email addresses, gathering rowcount(*), /20,000, and usinng that as a windowing function to group min/max substring(1,5) values based on percentages of total rowcount to build grouping ranges. Thoughts?

This is for an ETL process that cannot modify source databases.

I am hoping with one full index scan I can do a:

  • Query to get a histograph based on index usage (alphabetically sorted) and break it out (windowed) using min/max to create some ranges like this, so to easily seek the needed index:

  • A-> AAAX, (20k rows for example) AAA-Z, B-> (another 20k), B->BAAR -> BAAR-> CDEFG -> CDEFH > FAAH, etc.

We run read committed in these databases for this ETL process. We are only attempting to batch it out in scores of 20k rows because the DBA's say we are using too much network resources by grabbing tables in full. If the data has changed (which is a concern) we update our DW and staging tables on the fly.

I would love to use temp tables, but if I did, I'd spill into tempdb and get lashings via e-mail from the DBAs regarding it, and that the database is too big.

Best Answer

Essentially, you are asking if you can perform a single ordered scan through the data overall, while making no copies of the data, and returning 'x' disjoint sets of rows from the full set on each call. This is exactly the behaviour of an appropriately-configured API cursor.

For example, using the AdventureWorks table Person.EmailAddress to return sets of 1,000 rows:

DECLARE 
    @cur integer,
    -- FAST_FORWARD | AUTO_FETCH | AUTO_CLOSE
    @scrollopt integer = 16 | 8192 | 16384,
    -- READ_ONLY, CHECK_ACCEPTED_OPTS, READ_ONLY_ACCEPTABLE
    @ccopt integer = 1 | 32768 | 65536, 
    @rowcount integer = 1000,
    @rc integer;

-- Open the cursor and return the first 1,000 rows
EXECUTE @rc = sys.sp_cursoropen
    @cur OUTPUT,
    N'
    SELECT *
    FROM AdventureWorks2012.Person.EmailAddress
        WITH (INDEX([IX_EmailAddress_EmailAddress]))
    ORDER BY EmailAddress;
    ',
    @scrollopt OUTPUT,
    @ccopt OUTPUT,
    @rowcount OUTPUT;

IF @rc <> 16 -- FastForward cursor automatically closed
BEGIN
    -- Name the cursor so we can use CURSOR_STATUS
    EXECUTE sys.sp_cursoroption
        @cur, 
        2, 
        'MyCursorName';

    -- Until the cursor auto-closes
    WHILE CURSOR_STATUS('global', 'MyCursorName') = 1
    BEGIN
        EXECUTE sys.sp_cursorfetch
            @cur,
            2,
            0,
            1000;
    END;
END;

Each fetch operation returns a maximum of 1,000 rows, remembering the position of the scan from the previous call.