Mysql – Advice for the DB design and how to handle null / measuring errors

database-designMySQLnormalizationperformance

I am seeking advice for my own DB design. I need to make a big step and start a huge load of data processing. What I decide now will have big consequences for my project.

I have this table. It holds data that was acquired online from an external source.

app_data
+--------------+-------+------------+--------+---------+-------+
| surrogateKey | appid |  created   | owners | viewers | price |
+--------------+-------+------------+--------+---------+-------+
|            1 |     2 | 1472428100 | 10     |      25 | 10,00 |
|            2 |     2 | 1472428200 | 11     |      50 | 10,00 |
|            3 |     2 | 1472428300 | 15     |      50 | 10,00 |
|            4 |     2 | 1472428400 | 22     |      51 |  8,00 |
|            5 |     2 | 1472428500 | null   |      50 |  8,00 |
|            6 |     2 | 1472428600 | 20     |      49 |  8,00 |
|            7 |     2 | 1472428700 | 25     |      50 | 10,00 |
|          ... |       |            |        |         |       |
+--------------+-------+------------+--------+---------+-------+

CREATE TABLE app_data(
    surrogateKey BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    appID INT UNSIGNED NOT NULL,
    created TIMESTAMP NOT NULL,
    owners INT UNSIGNED,
    owners_variance MEDIUMINT UNSIGNED,
    viewers MEDIUMINT UNSIGNED, -- TINYINT would be better but application layer cannot handle tinyint
    viewers_variance MEDIUMINT UNSIGNED,
    price MEDIUMINT UNSIGNED,
    CONSTRAINT fk_appData_app_appID 
        FOREIGN KEY (appID) 
        REFERENCES app(appID)
        ON UPDATE CASCADE
) ENGINE=INNODB;
  1. Does it makes sense to create this surrogateKey? There is no value
    added by it. I will not be used for joins. appid and created would
    make a fine composite key and I will use these keys for joining
    tables.

  2. appID and created will be used for joining, comparisons, range
    queries, and so on. How should I create indexes? One index on each?
    (appID, created)? (created, appID)?

  3. This table will be HUGE and take 95% of the storage space. How can I optimize it further performance wise?
  4. How should I handle random null values as in the owners column? These mean basically: The thermometer was broken that day… I was
    contemplating in just turning them to zero or leaving them as null.
    null does mean that there were no measurements that day whereas zero
    would be ambiguous but easier for analysis.

  5. How should I handle values that don't make sense? For example
    viewers upper limit is 50. There is one value with 51 that means:
    This value is WRONG. Should I turn it to null? To 50?

  6. Should I normalize price into something like this? Normalization
    would make it more difficult to analyse and result in more
    complicated queries.

.

CREATE TABLE(
  appID INT NOT NULL PRIMARY KEY,
  created TIMESTAMP,
  start_date TIMESTAMP,
  end_date TIMESTAMP,
  price INT
);

Thanks!

Best Answer

Some of these questions are not so much about database design as about data cleaning and analysis - in those cases the correct answer really depends on the specifics of your analysis task.

  1. If the surrogate key doesn't add value (you won't even use it for joins) and you have certainty that (appid, created) will always be unique, then just use (appid, created) as a composite primary key.

  2. If you'll always be joining, querying, etc on (appid, created) together, than the index resulting from assigning those columns as the multi-column primary key will provide you what you need. The order shouldn't matter unless you're also going to be querying one of the two columns independently. (Edit: Rick James makes a good point about the likelihood of querying for equality on appid and some BETWEEN range for created. As he says, in that case go with (appid, created) for your order.)

  3. That depends on the queries you'll be making. Can you provide examples? Indexing where you'll need it and avoiding joins are your easiest options.

  4. This is really a data analysis question - but in general I would strongly recommend keeping missing values as null and transforming or dropping those values as needed for a given operation. Depending on the analysis you're performing the appropriate no-op value for missingness may be different (e.g. 0 for sums/means, but 1 for products, etc) or you may just want to be able to exclude those rows altogether, but still have a record that they were recorded.

  5. Again, this is very much a data cleaning / analysis decision - but in general I would leave observed data intact (even if you think it's invalid / nonphysical / otherwise not possible), and only drop it during your analysis. Assuming you have the storage space, more information is always better than less - what if later you find out some supposedly erroneous value actually was possible?

  6. From a performance perspective I would avoid needing to join tables, but clearly this would be desirable from a consistency / referential integrity perspective.

If price changes at regular time intervals, it might be possible to break your created column into two columns: one that describes a coarser time period where there's a constant price, and one that describes the more granular timestamp within that broader period. For example:

app_data:

|-------+------------------+----------------+--------+---------+-------|
| appid | coarsetimeperiod | finetimeperiod | owners | viewers | price |
|-------+------------------+----------------+--------+---------+-------|
|     2 |                1 |              1 |     10 |      25 | 10.00 |
|     2 |                1 |              2 |     11 |      50 | 10.00 |
|     2 |                1 |              3 |     15 |      50 | 10.00 |
|     2 |                2 |              1 |     22 |      51 |  8.00 |
|     2 |                2 |              2 |   null |      50 |  8.00 |
|     2 |                2 |              3 |     20 |      49 |  8.00 |
|     2 |                3 |              1 |     25 |      50 | 10.00 |
|-------+------------------+----------------+--------+---------+-------|

You would have a FOREIGN KEY (coarsetimeperiod, price) REFERENCE price_data(coarsetimeperiod, price) defined on app_data, where the price_data table looks like:

price_data :
|------------------+-------|
| coarsetimeperiod | price |
|------------------+-------|
|                1 | 10.00 |
|                2 |  8.00 |
|                3 | 10.00 |
|------------------+-------|

This would get you the referential integrity benefits of normalization but without the performance penalty of needing to continuously perform joins.