Storing reports with a lot of rows in relational database (Oracle)

oracleperformanceschema

I am setting up a relational database (Oracle) to store some reports with the following specifications:

Each report has the same columns (say 20 columns of strings or numbers). I expect the columns to be rarely changed but they might change.
Each report has a lot of rows (~200k).
The report generation frequency is like once every few days.

I have this schema in mind where report_id and row_id is the composite primary key:

+---------------+---------------+-----+
|    Column     |     Type      |     |
+---------------+---------------+-----+
| report_id     | uuid (string) | KEY |
| row_id        | number        | KEY |
| data_column_1 | string        |     |
| data_column_2 | number        |     |
| ...           | ...           |     |
+---------------+---------------+-----+

To retrieve a report, I will do something like

SELECT * FROM report_rows WHERE report_id = 'SOME-UUID';

Is this an appropriate design? Will performance suffer when there are more reports? If so, is there a better schema / other database to store data like this?

Thanks.

Best Answer

Some things to consider are;

How is the data populated, in one operation or over a period of time ?
How you access the data, e.g. how does a user know that report_id 32645 is the one they want ?
Will you need the data in a different order ?
Will one reports data ever be correlated with another reports data ?
Will data for a report_id ever need to be removed ? If so how do you identify it ?
If you use generic column names be prepared to confuse some people.

This sort of solution CAN work but like any solution it has some limitations (some of which can be seen from the above questions). Without knowing what sort of reports are stored its very hard to guess but in my experience reports tend to change quite frequently based on a customers needs.

They will see some data and that will lead them on the path to refining and expanding what they want to see - and that's a good thing.

Ideally you store the data in a way that reflects how it sits in the organisation (as you would with any modelling) and the reporting solution sits above that pulling out the information in whatever way it wants. This is what a data warehouse/reporting database is designed for.

Related Solutions

Sql-server – Efficiently storing sets of key-value pairs with wildly different keys

So what I am looking for is a way to store a large number of activities that have almost no fields in common in a way that makes reporting easier.

Not enough rep to comment first, so here we go!

If the primary purpose is reporting and you have a DW (even if it isn't star schema) I'd recommend attempting to get this into a star schema. The benefits are fast, simple queries. The downside is ETL, but you're already considering moving the data to a new design and ETL to star schema is likely simpler to build and maintain than an XML wrapper solution (and SSIS is included in your SQL Server licensing). Plus it starts the process of a recognized reporting/analytics design.

So how to do that... It sounds like you have what is known as a Factless Fact. This is an intersection of attributes that define an event with no associated measure (such as a sales price). You have dates available for some or all of your activities? Likely you should really have an intersection of an Activity, Site, and Date(s).

DimActivity - I'm guessing there is a pattern, something that can allow you to break these down into at least relatively shared columns. If so, you may have three? five? dimensions for classes of activities. At worst you have a couple consistent columns, such as activity name, you can filter on, and you leave general headings such as "Attribute1" etc. for the remaining random details.

You don't need everything in the dimension - there (likely) shouldn't be any dates in the Activity dimension - they should all be in the fact, as Surrogate Key references to the Date dimension. As an example, a Date that would stay in a person dimension would be a date of birth because it's an attribute of a person. A hospital visit date would reside in a fact, as it is a point in time event associated with a person, among other things, but it is not an attribute of the person visiting the hospital. More date discussion in the fact.

DimSite - seems straight forward, so we'll describe Surrogate Keys here. Essentially this is just an incrementing, unique ID. Integer Identity column is common. This allows separation of DW and source systems and ensures optimal joins in the data warehouse. Your Natural Key or Business Key is usually kept, but for maintenance/design not analysis and joins. Example schema:

CREATE TABLE [DIM].[Site]
(
 SiteSK INT NOT NULL IDENTITY PRIMARY KEY
,SiteNK INT NOT NULL --source system key
,SiteName VARCHAR(500) NOT NULL
)

DimDate - date attributes. Make a "smart key" instead of an Identity. This means you can type a meaningful integer that relates to a date for queries such as WHERE DateSK = 20150708. There are lots of free scripts to load DimDate and most have this smart key included. (one option)

DimEmployee - your XML included this, if it is more general change to DimPerson, and fill with relevant person attributes as they are available and pertinent to reporting.

And your fact is:

FactActivitySite
DimSiteSK - FK to DimSite
DimActivitySK - FK to DimActivity
DimEmployee - FK to DimEmployee
DimDateSK - FK to DimDate

You can Rename these in the Fact, and you can have multiple date keys per event. Facts are typically very large so avoiding updates is typically good... if you have multiple date updates to a single event you may want to try a Delete/Insert design by adding a SK to the fact which allows selection of "update" rows to be deleted then inserting latest data.

Expand your Fact dates to whatever you need: StartDateSK, EndDateSK, ScheduledStartDateSK.

All dimensions should have an Unknown row typically with a hardcoded -1 SK. When you load the fact, and an activity doesn't have any of the included Dates it should simply load a -1.

The fact is a collection of integer references to your attributes stored in the dimensions, join them together and you get all your details, in a very clean join pattern, and the fact, due to it's data types, is exceptionally small and fast. Since you are in SQL Server, add a columnstore index to increase performance further. You can just drop it and rebuild during ETL. Once you get to SQL 2014+ you can write to columnstore indexes.

enter image description here

If you go this route research Dimensional Modelling. I'd recommend Kimball methodology. There are lots of free guides out there too, but if this will be anything other than a one off solution, the investment is likely worth it.

PostgreSQL – Nullable Columns vs JSONB for Storing User Profile

You can use "best of both worlds". Implement every regular column you know of as dedicated column (with exact proper data type, possibly with NOT NULL / CHECK / PK / FK etc. constraints to enforce detailed rules, default values and comment for documentation) and add a "catch-all" jsonb column for very rare cases / later additions / highly volatile attributes that do not need any of the features mentioned above.

To be clear: "highly volatile" does not mean the value is updated a lot, which would clearly speak against any document type like jsonb, making updates more complex and expensive. It means the attribute per se (on the meta level) changes a lot (different names, rare and and pops in and out of existence over time or similar).

Most people underestimate how cheap NULL storage really is:

Best Answer

Related Solutions

Sql-server – Efficiently storing sets of key-value pairs with wildly different keys

PostgreSQL – Nullable Columns vs JSONB for Storing User Profile

Related Question