Sql-server – Leveraging an index on another column with the same ordering guarantee

indexsql server

I have a massive table that is only indexed by the auto-increment column, which is also populated with the current timestamp (which is not indexed).

CREATE TABLE MyTable
(
    ID BIGINT NOT NULL IDENTITY(1, 1) PRIMARY KEY
    ,AuditTimestampUtc DATETIME NOT NULL DEFAULT(GETUTCDATE())
    ,...
)

If I need to query by the creation timestamp of the row, how can I do it efficiently? Adding an index is not feasible because of how gargantuan the table is (hundreds of millions to billions of rows), as we cannot afford the downtime, and I am performing a rare debugging task in a readonly environment, which is essentially

SELECT [...] FROM MyTable WHERE AuditTimestampUtc BETWEEN @Start AND @End

I am trying to debug a novel issue, and have not had to do this task before, so I would have difficulty making the argument for creating a new index. And unfortunately there is quite a process for processing a request to create and sanitize a full database dump (especially given its size), or cloning it to another environment. I have an outdated dump to experiment with, but running the final query will be supervised through a read-only account on production.

Writing a custom binary search seems like overkill, especially in a RDBMS, but alas computers are not mind-readers, even though it is apparent to a person that the identity column can be used as a surrogate ordering* to efficiently search the table by creation time.

*Assuming nobody enables IDENTITY_INSERT to violate this temporal ordering guarantee.


P.S. I do not believe the database platform is very relevant to this question for the first time ever (ignoring the specific syntactical differences of declaring the index/default constraint/etc) but I am using SQL Server.

Best Answer

You could create a temporary table containing just the two columns you're interested in, using the key column as a pointer into the "real" table. Something like:

CREATE TABLE #t
(
    ID int NOT NULL
    , AuditTimestampUtc datetime NOT NULL
);

Insert rows from MyTable like this:

INSERT INTO #t WITH (TABLOCKX) (ID, AuditTimeStampUtc)
SELECT mt.ID
    , mt.AuditTimeStampUtc
FROM dbo.MyTable mt

Then create an index on the table like this:

CREATE CLUSTERED INDEX t_AuditTimeStampUtc
ON #t (AuditTimeStampUtc);

Now, you should be able to query the "real" table, making use of the index on the #temp table, as in:

SELECT <columns from mt>
FROM dbo.MyTable mt
    INNER JOIN #t t ON mt.ID = t.ID
WHERE t.AuditTimeStampUtc >= '2019-06-01 00:00:00'
    AND t.AuditTimeStampUtc < '2019-07-01 00:00:00'

The query above will, probably, do an index seek on the non-clustered index t_AuditTimeStampUtc, with a nested loops join into MyTable. This may be faster than just querying the original table. Especially if you need to do multiple queries like this against MyTable.

Copying data from a large table might seem like a bad idea. If the original table had only the two columns, then yes, I'd agree it's a dumb thing to do. However, if MyTable has many columns, the temp table will only occupy a small fraction of the space of the main table, and will be much more efficient.