I've got a v. simple table currently containing millions of data points of price on a given date:
CREATE TABLE data_point (
id INT PRIMARY KEY,
symbol_id INT,
x DATE,
y DOUBLE
)
This is a timeseries database and for this release I'll keep data with a daily (or less frequent) period but if I need to add a time component, would it be better to change the data_point
table and convert x
to DATETIME
?
If this did happen, initially all of the existing data would remain unchanged, it would not be adjusted – only future data (and not all of it) would be saved with time.
Or would it be better to add a second table with a one-to-one relationship to data_point
, which held the time data?
Performance is more critical than space.
This leads on to the associated question about this table. Which is better in terms of performance for huge amounts of data, when all the queries on data_point
will be joins from symbol
? (Assuming INT
for a primary key will get filled up within the app's expected lifetime)
CREATE TABLE symbol (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE data_point (
id BIGINT PRIMARY KEY,
symbol_id INT,
x DATE,
y DOUBLE,
CONSTRAINT FOREIGN KEY (symbol_id)
REFERENCES symbol(id)
);
or
CREATE TABLE data_point (
symbol_id INT,
x DATE,
y DOUBLE,
CONSTRAINT PRIMARY KEY (symbol_id, x),
CONSTRAINT FOREIGN KEY (symbol_id)
REFERENCES symbol(id)
);
Best Answer
When in doubt, keep it simple - just use
DATETIME
. I can think of two reasons to flakeTIME
off into another table, but neither are commonly applicable:DATETIME
column, populating it gradually, and then dropping the oldDATE
field when possible.You've stated that performance is more important than space, but I really wonder if six bytes for
DATETIME2(0..2)
is really going to make much difference over three bytes forDATE
. You need to value your own time, and the time of everyone who tries to understand your system in the future.Your question about the PK really should be asked separately. However, I concur with Rick James that, in this case, you probably don't need a synthetic key (your
id
field); just index onsymbol_id
anddate
. Assuming you're clustering on the PK, consider puttingdate
first, so you can easily query date ranges. If you primarily filter (not just join) bysymbol_id
, then put it first in the index instead. If you do both, cluster on {date
,symbol_id
} and index onsymbol_id
.