Database Partitioning – Merging Tables with Partially Overlapping Columns

join;partitioningunionview

I opted for MonetDB for my static data, but this might be less of a MonetDB-specific question about features than about RDBMS (or just SQL) in general.

I have data with a few "cross-sectional" tables (e.g. the parents of each individual) and many longitudinal/panel tables in annual chunks (e.g. tax records for each year separately).

I wonder if it is a good idea to merge/union the annual tables into long panels.

The benefits seems to be a much cleaner structure for the consistent columns, correspondingly easier code to write, fewer loops, tables names, lines of code. Also, fast joins on individual IDs and years across the tables without intermediary tables or views.

However, the downsides are that not all columns are consistent across years, and there are simply some that appear only in certain years. Storing missing values for other years for these variables is a waste of disk space and memory. I would also spend time on how to merge tables when there is a (rare) type change over the years (usually strings vs. TINYINTs). The merged tables could be too big anyway: think of tables ~70 GB in some cases, while my server has 128 GB RAM. Many use cases would focus on a subset of years anyway.

Actually some links I found were about how partitioning would improve performance (at least if I can learn to code using partitioning):

http://msdn.microsoft.com/en-us/library/ms190787.aspx

How Does Table Partitioning Help?

Improve performance by partitioning

Do you have any thoughts about this? Maybe more concrete questions would be:

  1. How cumbersome would it be (for a novice!) to code joining data from
    the annual tables separately, for multiple series of tables? (E.g.
    UNION ALL some columns of tax returns for 1997-2008, then join all residential
    data for these 12 years year-by-year, then join employment records
    year-by-year etc.)
  2. How much slower would this run than simply selecting years from the longitudinal tables?
  3. Shall I be worried
    about the longitudinal tables having the size on the same order of
    magnitude as my physical memory, or MonetDB handles it equally
    efficiently (it would need to access many of the small annual tables
    anyway, that also takes memory). Can at least the missing values for entire years be stored efficiently?

Best Answer

Having all your data in fewer, common tables is obviously going to make your coding simpler. I would suggest you pursue this option.

Partitioning is not that scary. Annualised data like yours is the archetypal usage for partitioning. It will help with your query performance and may help with your data management tasks (load / delete), depending on MonetDB's capabilities.

You say some data is in different formats in different years - tinyint v/s character. This will be a problem for you whatever architecture you choose because you will eventually want to combine these in one resultset. Fix this sooner rather than later. Fixing in the data and table definition will be less work overall than fixing it in each query as you write it.

In summary 1) Fairly, especially if you have to reconcile type differences 2) "A bit" if your queries are simple to "a lot" of you get it badly wrong 3) not a MonetDB expert but most RDBMS handle data many times the size of RAM without problem.