Mysql – DB Design – 1 column for each dataset OR dataset_name AND dataset_value

database-designeavMySQL

I am creating a database table for storing numeric time-series data sets regarding cities that will be accessed via a web app. Primarily annual, quarterly and monthly data will be stored in it. There will be hundreds, potentially a couple thousand data sets, for each city. Some data sets may have data on a monthly, quarterly AND annual basis while other data sets may only be available on a monthly, quarterly OR annual basis.

I am trying to decide between 2 different designs for the table's columns:

Option #1

city_id year subyear dataset_name dataset_value dataset_unit_type dataset_source

Option #2

city id year subyear dataset1-PROPERTY_TAXES dataset2-POPULATION dataset3-ETCdatasetX

With Option #1, a hypothetical row look like:

12345 _ 2012 _ 'Q4' _ 'Property Taxes Collected' _ 950000 _ 'USD' _ 'city govt'

OR with monthly data:

12345 _ 2012 _ 'M8' _ 'Property Taxes Collected' _ 330000 _ 'USD' _ 'city govt'

With Option #2, a hypothetical row look like:

12345 _ 2012 _ 'Q4' _ 950000 _ 1400500 _ 234.53 ... XXXX

OR with monthly data:

12345 _ 2012 _ 'M8' _ 330000 _ 1400500 _ 120.12 ... XXXX

If I do #1, I will have hundreds (or maybe a couple thousand?) of columns and many rows will have a lot of 'blank' spaces because a lot of data sets will not have data for each row (e.g. some data that is only available quarterly will always be 'blank' in columns for rows describing monthly data). Also, I think there will need to be a corresponding table for describing the sources and units (e.g. U.S. Dollars, Square Miles, etc.) for each dataset column. I am concerned that the large number of columns and blank spaces may be problematic.

Alternatively, if I do #2, I will have very few columns but millions (potentially billions?) of rows and will have many queries with a WHERE clause searching to see if the dataset_name column equals a certain string (e.g. SELECT * WHERE dataset_name='Property_Taxes_Collected'). Also, since all types of numeric values are stored in the dataset_value column, the data type here will have to be pretty flexible (Maybe something like DECIMAL 16,4?). I'm primarily concerned this may become slow with a lot of data.

I am using MySQL.

My Question:
Is possible to say which of these 2 general table designs is most likely the better design choice if my primary concern is query speed? Or is it fairly clear that I should be approaching this in a completely different manner?

Best Answer

Upon realization that I had unwittingly reinvented a modified EAV model with Option #1 and would likely be headed into a world of hurt with this option, I decided to go with Option #2.

Option #2 should preserve many of the benefits of RDBMS and I think MySQL/InnoDB can support the row sizes I currently expect. While I will have a quantity of NULL entries that some might find disagreeable, I expect the DB will perform queries quickly, which is my main concern.

At some future point, should I hit constraints/performance issues, I may consider a NoSQL option as it appears the data I am working with doesn't necessarily correspond so well to RDBMS.