I have a huge table RECORD
— a few billion rows — which contains data loaded continuously in a lot of batches. I can assign a primary key RECORD_ID
from sequence, or use natural composite key BATCH_NUM + RECORD_NUM
, which gives me a nice, unfragmented table where a batch with thousands of records only spans over several physical blocks, which is useful for next batch-aware step of processing.
But later I need to group together rows from arbitrary batches based on another field value, some RECORD_KEY
. There are hundreds of millions of different RECORD_KEY
s spread across rows of the table, therefore if I want to pick 10 rows with the same RECORD_KEY
, it will almost certainly require retrieval of 10 physical blocks. Due to the table size this also cannot easily be cached so it is 10 physical reads (plus overhead on index traversal). This is obviously very slow.
Example:
Table RECORD
RECORD_ID
is primary key from sequenceBATCH_NUM
andRECORD_NUM
are unique constraint (and could be a primary key)RECORD_KEY
is an indexed column
Sample data:
| RECORD_ID | BATCH_NUM | RECORD_NUM | RECORD_KEY
| 1 | 1 | 1 | 987654321
| 2 | 1 | 2 | 876543219
| 3 | 1 | 3 | 765432198
| 4 | 2 | 1 | 654321987
| 5 | 2 | 2 | 543219876
...
| 100000006 | 3000003 | 2 | 432198765
| 100000007 | 3000003 | 3 | 876543219
| 100000008 | 3000003 | 4 | 321987654
...
| 200000009 | 6000004 | 3 | 219876543
| 200000010 | 6000004 | 4 | 876543219
| 200000011 | 6000004 | 5 | 198765432
...
This SQL command will be fast because it will only have to retrieve one physical block:
select RECORD_ID from RECORD where BATCH_NUM = 1
This SQL command will be slow because it will have to retrieve three physical blocks — one for each retrieved row:
select RECORD_ID from RECORD where RECORD_KEY = 876543219
EDIT:
The above is just an example. Typically I would have:
- thousands of rows per BATCH_NUM
- tens of rows per RECORD_KEY
- thousands of rows per lookup, by BATCH_NUM of RECORD_KEY
SQLs look like this, both retrieving around 1000 rows:
select RECORD_ID from RECORD where BATCH_NUM = 123456
select RECORD_ID from RECORD where RECORD_KEY in (
select COLUMN_VALUE from TABLE(batch_num_tbl) -- 100 values
)
The execution plans look both reasonable and similar — using the respective index. However while the BATCH_NUM lookup takes under 1 second to execute, RECORD_KEY lookup takes around 20 seconds.
Options which I considered:
- Partitioning: I could do of hash partitioning on
RECORD_KEY
. It would put the records a bit closer to each other, increasing the chance of them being in the same block, and enabling some partition-wise joins. Might help a bit but will not resolve the problem for me. - Index-organized table: Since it requires primary key, I would have to make primary key
RECORD_KEY + RECORD_ID
. Also the lookup onBATCH_NUM
would then become slow. - Re-inserting the rows: A rather ugly solution, where I would regularly select rows ordered by
RECORD_KEY
, delete them and insert append them, effectively putting them next to each other.
Please note I am not any experienced DBA, just a developer stuck with this non-trivial DB problem.
Best Answer
Create an index on
(record_key, record_id)
. This way oracle will most likely use the index only to answer your slow running query. You can add more columns to that index, so that you also get the needed content. Hopefully you don't need all columns.Be careful about, what impact this index has on your load performance.