Do not use double(10,2)
it involves two roundings. Either do DECIMAL(10,2)
or plain DOUBLE
. For monetary values (price
) use DECIMAL(...)
so that there will be no rounding. DOUBLE
is 8 bytes; DECIMAL(10,2)
is 5 bytes.
Can you have "negative" Clicks? Suggest INT UNSIGNED
.
It is usually a bad idea to splay arrays across columns (L1, ... and Attribute1, ...) Instead have another table for them. (OK, I don't know what impact that will have on queries that GROUP BY
Attributes.)
PARTITION BY HASH
as no known case of improving performance.
attribute_changed int(11) DEFAULT NULL
-- If that is just a flag, make it TINYINT UNSIGNED NOT NULL DEFAULT '0'
; that will save 3 bytes, plus space for NULL
.
Will you have a billion different Campaigns? CampaignName varchar(255)
should be normalized and replaced by, say, a MEDIUMINT UNSIGNED
(3 bytes) to save a lot of space. Ditto for any other varchars that repeat a lot.
If you expecting billions of rows, squeezing out a few bytes per row can add up. That, in turn, will decrease the I/O, thereby speeding up queries.
Your covering index on the 5 attributes could consume a huge amount of space. Furthermore, with a billion rows, it may slow down INSERTs
to one row per disk hit! On traditional drives, that is only 100/sec. You need more than 300/sec.
Since you say the users must include a Date clause, then it may be practical to use PARTITION BY RANGE(TO_DAYS(Date))
. It is unclear, but it sounds like user_id
is also a requirement in the queries? At that point, I would suggest INDEX(user_id, Date)
without partitioning. That composite index is much better than "and index on Date". Adding more columns to that index will not help.
Your example has a Date range of one month plus one day; is that reasonable, or just a goof?
One table per client does not help.
Do not partition by user -- 250 partitions has its own performance problems. About 50 is the 'practical' limit for the number of partitions.
Summary tables (plural) is the only way you will make this multi-billion row table perform adequately. The PRIMARY KEY
of each summary table would include user_id and date (probably truncated to the day), plus a couple of other "dimensions". Then several aggregations (mostly COUNTs
and SUMs
) would be the other columns. These tables would be significantly smaller than your 'Fact' table, and they can afford to have multiple indexes. Beware of AVG
because the average of averages is not mathematically correct. Instead store the SUM
and COUNT
, then calculate the average as SUM(sums)/SUM(counts)
.
More on Data Warehousing and Summary Tables.
(Yes, some of my statements do disagree with previous Comments and Answers.)
Best Answer
Basic query to sample data for arbitrary fixed time intervals:
Support this with a multicolumn index on
(device_id, message_date)
.For a time interval of 10 seconds like you mentioned, you don't need an index. This would fetch 10 % of all rows, and Postgres will typically default to a sequential scan as that is faster. (Exceptions with index-only scans apply.)
The added filter
device_id = 'CP2'
can change a lot. But that's only in your query, not in your expressed requirements.An index typically only helps performance for filters retrieving a few percent of all rows or less. So like 30 seconds or more in your example. For selective filters, an index can help a lot.
Further reading: