I have a massive table that is only indexed by the auto-increment column, which is also populated with the current timestamp (which is not indexed).
CREATE TABLE MyTable
(
ID BIGINT NOT NULL IDENTITY(1, 1) PRIMARY KEY
,AuditTimestampUtc DATETIME NOT NULL DEFAULT(GETUTCDATE())
,...
)
If I need to query by the creation timestamp of the row, how can I do it efficiently? Adding an index is not feasible because of how gargantuan the table is (hundreds of millions to billions of rows), as we cannot afford the downtime, and I am performing a rare debugging task in a readonly environment, which is essentially
SELECT [...] FROM MyTable WHERE AuditTimestampUtc BETWEEN @Start AND @End
I am trying to debug a novel issue, and have not had to do this task before, so I would have difficulty making the argument for creating a new index. And unfortunately there is quite a process for processing a request to create and sanitize a full database dump (especially given its size), or cloning it to another environment. I have an outdated dump to experiment with, but running the final query will be supervised through a read-only account on production.
Writing a custom binary search seems like overkill, especially in a RDBMS, but alas computers are not mind-readers, even though it is apparent to a person that the identity column can be used as a surrogate ordering* to efficiently search the table by creation time.
*Assuming nobody enables IDENTITY_INSERT
to violate this temporal ordering guarantee.
P.S. I do not believe the database platform is very relevant to this question for the first time ever (ignoring the specific syntactical differences of declaring the index/default constraint/etc) but I am using SQL Server.
Best Answer
You could create a temporary table containing just the two columns you're interested in, using the key column as a pointer into the "real" table. Something like:
Insert rows from MyTable like this:
Then create an index on the table like this:
Now, you should be able to query the "real" table, making use of the index on the #temp table, as in:
The query above will, probably, do an index seek on the non-clustered index
t_AuditTimeStampUtc
, with a nested loops join into MyTable. This may be faster than just querying the original table. Especially if you need to do multiple queries like this againstMyTable
.Copying data from a large table might seem like a bad idea. If the original table had only the two columns, then yes, I'd agree it's a dumb thing to do. However, if
MyTable
has many columns, the temp table will only occupy a small fraction of the space of the main table, and will be much more efficient.