100 TeraBytes Capacity Database – Resources and Time Estimates

data-warehousedatabase-recommendationoracle

I am working on a 'back of the envelope' calculation for a 100TB reporting database setup. I am seeking out thoughts from the experts here. Proposed environment:

  1. Storage Capacity ~ 100TB
  2. Tables ~ 200, sizes ranging from 1GB to 5TB. mean size could lie between 100GB-200GB
  3. ETL – jobs may require join between tables of 10's of millions of rows, with join keys ranging from 10 bytes to 500 bytes. such joins should finish in under 2-5 minutes
  4. Live Selects – initially, only interested in select speeds. should support 500 selects/second. Updates/second will be relatively much smaller number and can be ignored for this exercise.
  5. need 24×7 availability. 2 independent DB servers should be available to serve select calls (with data replicated).

Questions:

  1. At present, I am looking at Oracle. How has your experience been with other commercial (or) opensource solutions for large databases?
  2. What hardware-OS have you seen to work best? I am planning for Linux on Dell.
  3. Is Network storage, such as NetApp, a must? What issues do you foresee with using commercial off the shelf disks?
  4. Once the hardware and OS are ready, how much time would you set aside to setup, configure DB, storage etc.
  5. What team compositions worked best in the environments you have observed? I mean, the various Admins (OS Admin, Oracle DB Admin?) required to manage and operate such a setup. How many of them might be needed to achieve a 24×7 uptime.
  6. Any approximation/range on DB Licenses, Network Storage costs.

I know I dont have all the environment details. I am not looking for exact details, an approximation is sufficient. Though some of the questions might be best answered by managers, I am interested in Admins perspective. I appreciate your input.

Best Answer

First impressions

  1. Depending on your performance requirements, 100TB is a fairly aggressive data volume. If you want Oracle, you should check out their Exadata systems. Also, take a look at the offerings from Netezza or Teradata. With that volume of selects you might want to look at an OLAP based front end or at least fairly aggressive use of materialised views and query rewrite. You won't get 500 table scans/sec out of anything.

    For stuff with less stringent latency requirements you might want to consider a larger number of data marts to provide the reporting capacity to your user community. In this case, SQL Server and SSAS might be an option for the data marts as the licensing on a larger number of servers will be cheaper than trying to do the same with Oracle.

  2. See (1). Conventional hardware on a shared-disk architecture is likely to be slow on this size data set.

  3. NO! If anybody suggests NFS give them a good kicking. Either direct attach storage or a multiple controller SAN with lots of mid-range controllers. Think in terms of maybe a couple of dozen MD3000 series controllers or something similar - if you don't go for a purpose built 'big data' platform.

  4. Get a storage specialist with experience in PB range data warehouse platforms. You're probably up for a significant ETL development job, and a lot of testing work if you have to meet a stiff SLA.

  5. 24x7 on a data warehouse is ambitious at the best of times. Is this an operational reporting platform? Perhaps you might elaborate on your requirements a bit.

  6. Sphincter-puckeringly expensive, and dependent on your performance requirements. Last I saw (a couple of years ago) Netezza used to quote $20,000/TB for TwinFin systems, making your platform $2m for 100TB plus the cost of your redundant server and backup hardware. Exadata is, I believe, a bit cheaper, but I don't have any pricing to hand.

    Take a look at Netezza, Exadata and a Teradata platform for comparison, and costings for Ab Initio as an ETL tool.

This is a fairly aggressive set of requirements - 24x7 on a data warehouse isn't normally done and the data volumes are big enough to put you in the realm of a 'big data' platform. If you have an operational reporting requirement then you should look carefully at what that is. Keep it separate from your analytics unless you have a specific reason (e.g. a low-latency market data feed) not to. Mixing operational and analytic requirements on the same platform is bad mojo.

I think you really need to get in specialists to evaluate your requirements. Without a closer look at what you are trying to achieve all I can give is some empirical suggestions about what to do or not to do.