We have a tenanted data warehouse that we are doing reporting on. Queries are beginning to take a long time, and we're looking at options to reduce this. There are two thoughts at the moment.
-
Create tenant specific aggregate tables, and query from those.
-
Horizontally partition the data based on tenant.
The first option means that for every new tenant that comes onboard, we'll need to create a new set of tables. This isn't that difficult, as a new tenant signing up comes with a few weeks notice, and a lack of reporting will be revealed early on if we forget.
Partitioning the data to me sounds like a better approach, since we're not duplicating the data. We don't have to rely on a process to transfer new data to the aggregate tables.
I'm wondering which of these options would be better, if anyone has had similar experiences before. Does partitioning the data actually help? Or would it be not so much different from leaving all the data in one 'space'?
And, with Oracle 10g, how does one horizontally partition data? If I had the following table:
TABLE Transaction(id, tenant_id, a, b, c, d)
We'll be migrating to Oracle 11g soon, so any differences in partitioning across versions would be appreciated.
(Note: I tried to use a partitioning tag, but not enough rep, if someone else could add a tag that'd be cool)
Best Answer
To answer your second question first: yes you should partition. Oracle's query optimizer has a feature called partition elimination, which will check the predicate for the partition and only execute the SQL on the appropriate partitions.
Partitioning also leaves all the data in one space. Conceptually, think of it as many tables of identical structure, with an implicit
UNION ALL
between them if you were to do aSELECT
from the entire table. Except "under the hood" Oracle sorts the actual rows into the right "table" based on the criteria you specify. Any rows that come in that don't match any of the criteria, go into what's known as the "default" partition.For what you want to do, a "range partition" might be a good approach (so you can add more tenants later), e.g.:
Then later
This will create something that looks and behaves just like a normal table, but actually rows where tenant_id=1 will be in a partition in tablespace ts_tenant1, and queries will ignore all other partitions. Queries across the entire table can run in parallel on each partition. If tenant_id=4 in this scenario, the row will live in ts_default unless you add the new partition as shown, but the
INSERT
won't be rejected because there's no partition for it!FWIW At my site we use partitioned tables in our 40Tb DW, you don't need to worry about this approach scaling or performing, if you choose your partitioning strategy well (e.g. you could partition on tenant_id then subpartition on month perhaps), create the right indexes, and so on.