Partitioning vs Indexes

indexoraclepartitioning

I'm very new to databases, so I apologize if my question is very basic…

Anyways, I'm creating a table with what seems like alot of data (500 million rows right now, and potentially twice as much in the future). Now, I need to have a way to access the data in this table quickly, so I'm researching partitions and indexes. However, I'm getting confused on when I'd want to create a partition vs. an index. I have three columns that seem like reasonable candidates for partitioning or indexing:

  • Time (day or week, data spans a 4 month period)
  • Customer "Token": The last two digits of a user's ID
  • Client ID (typically will have 5-20 clients in the dataset)

When running future selects against this table, it's likely that I'll be filtering on client id as well as wanting to do some sampling (which I would like to do by the "Token" variable). I may occasionally be filtering by the time variable as well.

So, my question is this: how should I organize my table? Should I partition by Client and Token, and then create an index on time? Or just partition on client and create indices on time and token? And, more importantly, what is the logic behind the strategy you recommend?

Also, after I've created the table, will the indexes break if I add more data into it (particularly new clients for the same date/token range)? Is recreating an index relatively simple?

Thanks so much for your help, and please let me know if you need any more information from me.

Best Answer

To put it simply, indexes allow fast access to small proportions of a table. This is because they access data that is scattered throughout many block in the data segment, so unless the rows you are looking for are clustered into a small number of blocks the total cost of accessing all of those single blocks will soon become greater than just scanning a table.

At absolute best you might get benefits from an index if you are accessing 20% of a table's rows, but it's more likely that 1-5% is an effective limit.

If you want to access a greater proportion effectively, say 10% of the rows, then if you can use a partitioning scheme to isolate "queryable" groups into partitions then they can be queried very quickly. Even if you are accessing 1% of a table's rows then if the query can retrieve them from a partition that holds just those rows with a full partition scan then that will be quicker than accessing them through an index -- around 1/100th of the time to perform a full table scan (ignoring parallel query).

So if your queries very often include a predicate on the client id then partition on that -- list partitioning, I would suggest. If you also query on ranges of dates, then consider range partitioning on the time column.

So you could composite partition by range-list, or by list-range. If you want to drop old data easily then range-list could be better, but I'm not sure that there's a lot to choose between them.

The token sounds like it could be a good candidate for an index. Indexes are self-maintaining, and modifying data will not invalidate them. Obviously there's an overhead to maintaining them, but data is generally queried many time more often than it is modified so on balance I tend to "index first, ask questions later".