Partitioning vs Indexes

indexoraclepartitioning

I'm very new to databases, so I apologize if my question is very basic…

Anyways, I'm creating a table with what seems like alot of data (500 million rows right now, and potentially twice as much in the future). Now, I need to have a way to access the data in this table quickly, so I'm researching partitions and indexes. However, I'm getting confused on when I'd want to create a partition vs. an index. I have three columns that seem like reasonable candidates for partitioning or indexing:

Time (day or week, data spans a 4 month period)
Customer "Token": The last two digits of a user's ID
Client ID (typically will have 5-20 clients in the dataset)

When running future selects against this table, it's likely that I'll be filtering on client id as well as wanting to do some sampling (which I would like to do by the "Token" variable). I may occasionally be filtering by the time variable as well.

So, my question is this: how should I organize my table? Should I partition by Client and Token, and then create an index on time? Or just partition on client and create indices on time and token? And, more importantly, what is the logic behind the strategy you recommend?

Also, after I've created the table, will the indexes break if I add more data into it (particularly new clients for the same date/token range)? Is recreating an index relatively simple?

Thanks so much for your help, and please let me know if you need any more information from me.

Best Answer

To put it simply, indexes allow fast access to small proportions of a table. This is because they access data that is scattered throughout many block in the data segment, so unless the rows you are looking for are clustered into a small number of blocks the total cost of accessing all of those single blocks will soon become greater than just scanning a table.

At absolute best you might get benefits from an index if you are accessing 20% of a table's rows, but it's more likely that 1-5% is an effective limit.

If you want to access a greater proportion effectively, say 10% of the rows, then if you can use a partitioning scheme to isolate "queryable" groups into partitions then they can be queried very quickly. Even if you are accessing 1% of a table's rows then if the query can retrieve them from a partition that holds just those rows with a full partition scan then that will be quicker than accessing them through an index -- around 1/100th of the time to perform a full table scan (ignoring parallel query).

So if your queries very often include a predicate on the client id then partition on that -- list partitioning, I would suggest. If you also query on ranges of dates, then consider range partitioning on the time column.

So you could composite partition by range-list, or by list-range. If you want to drop old data easily then range-list could be better, but I'm not sure that there's a lot to choose between them.

The token sounds like it could be a good candidate for an index. Indexes are self-maintaining, and modifying data will not invalidate them. Obviously there's an overhead to maintaining them, but data is generally queried many time more often than it is modified so on balance I tend to "index first, ask questions later".

Related Solutions

SQL Server Partitioning – Partitioning and Clustered Indexes in SQL Server 2008

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Does ‘update global indexes’ not work if the index is already unusable

The update [global] indexes option does not rebuild indexes after whatever operation you ran. It maintains indexes during the operation (see Updating Indexes Automatically), so that they remain available throughout.

After your first truncate, the index needs to be rebuilt - it can't be updated, it's already "broken" and unavailable. Oracle won't rebuild it automatically for the next operation you do on the table, even if you include the update clause.

Best Answer

Related Solutions

SQL Server Partitioning – Partitioning and Clustered Indexes in SQL Server 2008

Does ‘update global indexes’ not work if the index is already unusable

Related Question