Mysql – DB Design – 1 column for each dataset OR dataset_name AND dataset_value

database-designeavMySQL

I am creating a database table for storing numeric time-series data sets regarding cities that will be accessed via a web app. Primarily annual, quarterly and monthly data will be stored in it. There will be hundreds, potentially a couple thousand data sets, for each city. Some data sets may have data on a monthly, quarterly AND annual basis while other data sets may only be available on a monthly, quarterly OR annual basis.

I am trying to decide between 2 different designs for the table's columns:

Option #1

city_id year subyear dataset_name dataset_value dataset_unit_type dataset_source

Option #2

city id year subyear dataset1-PROPERTY_TAXES dataset2-POPULATION dataset3-ETC … datasetX

With Option #1, a hypothetical row look like:

12345 _ 2012 _ 'Q4' _ 'Property Taxes Collected' _ 950000 _ 'USD' _ 'city govt'

OR with monthly data:

12345 _ 2012 _ 'M8' _ 'Property Taxes Collected' _ 330000 _ 'USD' _ 'city govt'

With Option #2, a hypothetical row look like:

12345 _ 2012 _ 'Q4' _ 950000 _ 1400500 _ 234.53 ... XXXX

OR with monthly data:

12345 _ 2012 _ 'M8' _ 330000 _ 1400500 _ 120.12 ... XXXX

If I do #1, I will have hundreds (or maybe a couple thousand?) of columns and many rows will have a lot of 'blank' spaces because a lot of data sets will not have data for each row (e.g. some data that is only available quarterly will always be 'blank' in columns for rows describing monthly data). Also, I think there will need to be a corresponding table for describing the sources and units (e.g. U.S. Dollars, Square Miles, etc.) for each dataset column. I am concerned that the large number of columns and blank spaces may be problematic.

Alternatively, if I do #2, I will have very few columns but millions (potentially billions?) of rows and will have many queries with a WHERE clause searching to see if the dataset_name column equals a certain string (e.g. SELECT * WHERE dataset_name='Property_Taxes_Collected'). Also, since all types of numeric values are stored in the dataset_value column, the data type here will have to be pretty flexible (Maybe something like DECIMAL 16,4?). I'm primarily concerned this may become slow with a lot of data.

I am using MySQL.

My Question:
Is possible to say which of these 2 general table designs is most likely the better design choice if my primary concern is query speed? Or is it fairly clear that I should be approaching this in a completely different manner?

Best Answer

Upon realization that I had unwittingly reinvented a modified EAV model with Option #1 and would likely be headed into a world of hurt with this option, I decided to go with Option #2.

Option #2 should preserve many of the benefits of RDBMS and I think MySQL/InnoDB can support the row sizes I currently expect. While I will have a quantity of NULL entries that some might find disagreeable, I expect the DB will perform queries quickly, which is my main concern.

At some future point, should I hit constraints/performance issues, I may consider a NoSQL option as it appears the data I am working with doesn't necessarily correspond so well to RDBMS.

Related Solutions

Mysql – effective thesql table/index design for 35 million rows+ table, with 200+ corresponding columns (double), any combination of which may be queried

coincidently I am also looking into one of the client support where we designed key-value pair structure for flexibility and currently table is over 1.5B rows and ETL is way too slow. well there are lot of other things in my case but have you thought about that design. you will have one row with all 200 columns present value, that row will convert in to 200 rows in Key-Value pair design. you will gain space advantage with this design depending on for a given AssetID and Date how many rows has actually all 200 f1 to f200 values present? if you say even 30% od columns have NULL value than that is your space saving. because in key-value pair design if value id NULL that row doesn't need to be in table. but in existing column structure design even NULL takes space.(I am not 100% sure but if you have more that 30 columns NULL in table then NULL take 4bytes). if you see this design and assume that all 35M rows has values in all 200 columns then you current db will become 200*35M=700M rows in table right away. but it will not be much high in table space what you had with all columns in single table as we are just Transposing the Columns in to row. in this transpose operation actually we will not have rows where the values are NULL. so you can actually run query against this table and see how many nulls are there and estimate you target table size before you actually implement it.

second advantage is read performance. as you mentioned that new way of querying the data is any combination this f1 to f200 column in where clause. with key value pair design f1 to f200 are present in one column lets say "FildName" and their values are present in second column lets say "FieldValue". you can have CLUSTERED index on both columns. your query will be UNION of those Selects.

WHERE (FiledName = 'f1' and FieldValue BETWEEN 5 AND 6)

UNION

(FiledName = 'f2' and FieldValue BETWEEN 8 AND 10)

etc.....

I will give you some performance numbers form actual prod server. we have 75 price columns for each security TICKER.

SQL Server Database Design – Preserving Values and Additional Columns

Treat the various processes as sub-types. There will then be entity process_base, which contains all the common attributes, process1, process2, process2a etc. for the process-specific attributes.

Implement these as a table each. A view which combines them all together may simplify usage

create view process_all as
select <whatever>
from process_base
inner join process1 <etc>

union all

select <whatever>
from process_base
inner join process2 <etc>
...

This way you minimise the NULL columns (if that is desirable to you) but maintain the unity of "process" as a single idea.

Best Answer

Related Solutions

Mysql – effective thesql table/index design for 35 million rows+ table, with 200+ corresponding columns (double), any combination of which may be queried

SQL Server Database Design – Preserving Values and Additional Columns

Related Question