Mysql – Adding TIME data to a DATE type column in time series table, and what sort of foreign key

datedatetimeforeign keyMySQLprimary-key

I've got a v. simple table currently containing millions of data points of price on a given date:

CREATE TABLE data_point (
    id INT PRIMARY KEY,
    symbol_id INT,
    x DATE,
    y DOUBLE
)

This is a timeseries database and for this release I'll keep data with a daily (or less frequent) period but if I need to add a time component, would it be better to change the data_point table and convert x to DATETIME?

If this did happen, initially all of the existing data would remain unchanged, it would not be adjusted – only future data (and not all of it) would be saved with time.

Or would it be better to add a second table with a one-to-one relationship to data_point, which held the time data?

Performance is more critical than space.

This leads on to the associated question about this table. Which is better in terms of performance for huge amounts of data, when all the queries on data_point will be joins from symbol? (Assuming INT for a primary key will get filled up within the app's expected lifetime)

CREATE TABLE symbol (
    id INT PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE data_point (
    id BIGINT PRIMARY KEY,
    symbol_id INT,
    x DATE,
    y DOUBLE,
    CONSTRAINT FOREIGN KEY (symbol_id) 
        REFERENCES symbol(id)
);

or

CREATE TABLE data_point (
    symbol_id INT,
    x DATE,
    y DOUBLE,
    CONSTRAINT PRIMARY KEY (symbol_id, x),
    CONSTRAINT FOREIGN KEY (symbol_id) 
        REFERENCES symbol(id)
);

Best Answer

When in doubt, keep it simple - just use DATETIME. I can think of two reasons to flake TIME off into another table, but neither are commonly applicable:

  1. Your table is very large and in use constantly, so you can't afford to block it while changing a data type. I'd still work around this by adding a new DATETIME column, populating it gradually, and then dropping the old DATE field when possible.
  2. It's very important that you keep your table narrow, and you'll need the time values very rarely.

You've stated that performance is more important than space, but I really wonder if six bytes for DATETIME2(0..2) is really going to make much difference over three bytes for DATE. You need to value your own time, and the time of everyone who tries to understand your system in the future.

Your question about the PK really should be asked separately. However, I concur with Rick James that, in this case, you probably don't need a synthetic key (your id field); just index on symbol_id and date. Assuming you're clustering on the PK, consider putting date first, so you can easily query date ranges. If you primarily filter (not just join) by symbol_id, then put it first in the index instead. If you do both, cluster on { date, symbol_id } and index on symbol_id.