Mysql – Economic Database Architecture Possibilities w/ Data of Varying Length

Architecturedatabase-designMySQL

I'm in the process of designing a MySQL database (InnoDB) that holds a large amount of economic data which our application scores. The size of the economic data varies a great deal: some measures, such as percentage changes, are fractions of a percent; others, such as national debt figures, are 14+ digits long. Furthermore, we have a business requirement that states some data points need to be correct to the 6th decimal place.

We currently have over 500,000 rows in the legacy database, but anticipate growing that number substantially in the future, as we are building the new database with point-in-time considerations and, for the most part, will not be deleting or updating rows, only adding new rows and superseding the old rows.

All potential tables containing this economic data will be structured as follows:

id | country_id | period_id | [economic_data] | data_type | date_created | date_superseded

My question is this:

Is it best to :

Break out all of these individual economic data series into their own tables, given that the data comes in varying sizes, or
Combine all this data into one massive table, given the identical structure of all economic data tables and the simplicity it would offer for writing queries?

We collect over 200 data series and are planning on increasing that number every year, so option 1 would require creating and maintaining 200+ tables.

Option 2 seems the easiest to develop and maintain, but I wonder what the implications might be on query performance and storage.

Any thoughts or suggestions?

Best Answer

I would separate the problem into transactions and analytics -- based on the question, seems that you are trying to find a design which would be optimal for both.

From a design point -- on the logical level -- I would use something like this, and would not worry about number of tables. Also, each attribute has proper data type, etc.

enter image description here

From this you may periodically (daily) publish to structures which are more analytic-friendly (flat OLAP tables, data marts ...). Depending on the performance -- and user expectation -- exposing 5NF views may be good enough.

On the physical level, I am not so sure :(

Structures like this are usually exposed to users as flat views (5NF) and via point-in-time functions. The main problem here is that the question is tagged MySQL. MySQL has a limit on number of tables that can appear in a join (61) and the query optimizer does not support table elimination; hence, forget the views. You would have to use the application level to "run-around" and join tables based on the ID and date; the application may be the ETL code that exports to analytic tables.

So, now it depends on how do you expose this to final users -- if they are supposed write custom queries this will not work.

It is a common approach to design a DB on a logical level without a regard for the target DB, but in this case the selection of the DB limits design options.

Related Solutions

Mysql – Why does adding LIMIT to the query make it crawl

Things to try:

Adding an index on (user_id, date, score)

Group by only on scores table and then join to users:

SELECT s.total, u.name, u.gender, u.dob, u.country
FROM users AS u
  JOIN 
  ( SELECT user_id, SUM(score) AS total
    FROM scores
    WHERE date >= '2012-01-01' AND date < '2012-02-01'
    GROUP BY user_id
    HAVING SUM(score) >= 1000
    ORDER BY total DESC LIMIT 50
  ) AS s
      ON u.id = s.user_id
ORDER BY total DESC ;

Best way to store country and city information

Countries are relatively stable so it is easy to have a countries table. City names can be duplicated multiple times within a country. Maintaining the city table may be difficult for a couple of reasons.

Some cities have different spellings in different languages.
Some countries have multiple cities with the same name. These are likely in different states or provinces.

You may want to allow free format for the city and keep it in the user table. Otherwise, investigate adding a state/province table. You will still need to deal with how to add new cities as needed.

If you index the sender, the count can be done on the index. This can be extremely efficient for smaller counts. Updating the user table for each message will increase the log data volume, and increase the cost of saving each message. It is also error prone, and can lead to deadlocks if you are not consistent in the order the accesses are done. De-normalizing the model should be done only after determining you have a performance problem.

A query like SELECT COUNT(1) FROM message WHERE sender=?ID is likely better than maintaining a count.

Best Answer

Related Solutions

Mysql – Why does adding LIMIT to the query make it crawl

Best way to store country and city information

Related Question