Are there significant differences with 500+ million row tables in Oracle

oracle-11g-r2

I am in a database designer in a data warehouse environment. I am used to dealing with tables with a maximum of 1 millions rows and am now faced with tables with more than half a billion rows. Are there any significant differences with the tools in the "efficiency toolbox"? Can I trust my previous knowledge of indexes, partitions, and the like or are some of these specific tools more of a hindrance than a help with such large data? Any other tips for dealing with the tables?

(Already found a great post on Updating 700 million rows to same value)

Best Answer

The fundamentals of indexing etc. all work in exactly the same way, so strictly speaking the only difference is the cost of getting this wrong!

That said, here's a (not necessarily complete) list of things worth bearing in mind:

  • B-tree indexes are likely to have an extra level in them, so the cost of using them is slightly higher. However, in a DW you should be using bitmap indexes (assuming you've got enterprise edition)
  • It'll take a lot longer to calculate stats for the whole table - to the point where it may not be possible in the normal overnight window. This can be overcome by
    • Using a smaller estimate_percent when gathering stats so less of the table is sampled.
    • Using incremental stats gathering (only relevant if you have global indexes on partitioned tables however)
  • Histograms for indexes are limited to 254 buckets. More rows likely means more distinct values, meaning that "nearly popular" values can be a larger issue for skewed data.
  • The chances your whole table will fit in the buffer cache approaches nil, meaning you're more likely to have more physical (disk) reads. Your normal working set may also be too large to be cached.
  • Partitioning can be your friend - if you get it right! If you're usually modifying and query data across multiple partitions then it can cost you more than plain tables.
  • Materialized views can be very useful for reducing your working set. e.g. if you have 10+ years worth of data, but the vast majority of user queries are just against the past 2 years, then creating an MV limited to just this data can be a big help.
  • The bigger the dataabase, the less likely it is the business will (be able to) fund a test database that is a full duplicate of the live environment. This makes it harder to reproduce performance issues in test as slow queries may be due to the scale and/or physical storage of the data. You can't count on being able to extrapolate query results from a much smaller test DB to corresponding performance in live.

If you're not already familiar with reading and understanding execution plans I'd spend some time learning these: your bound to run into performance issues at some point so knowing how to diagnose the problem correctly will become more important as it's harder to add new indexes or make schema changes when your row counts are larger.