I have a database with a huge table, that gathers the ranking history of mobile applications. The table is quite big, around 120 Go.
In order for my DB queries not to be too slow, I implemented several materialized views.
One of these materialized calculates the average rankings of apps over the last 30 days. It is refreshed everyday.
I now want to be able to tell at any point in time what that average was on a particular date. I.e, to have an history of averages.
Would it make sense to add the results of my materialized view everyday to a table ? Or should I partition that big table and do it another way ?
Edit : Main table structure (738,681,765 rows)
+----+--------+---------+------+-------+---------------+------------+-------------+
| id | app_id | ranking | date | price | collection_id | country_id | category_id |
+----+--------+---------+------+-------+---------------+------------+-------------+
| 1 | 1426 | 30 | t1 | 0 | 12451 | 1658 | 2564 |
| 2 | 1427 | 15 | t2 | 0 | 23562 | 1485 | 3256 |
| 3 | 1428 | 22 | t3 | 0 | 14564 | 1320 | 4521 |
| 4 | 1429 | 11 | t4 | 0 | 12468 | 1578 | 5015 |
| 5 | 1430 | 10 | t5 | 0 | 18712 | 1100 | 6012 |
+----+--------+---------+------+-------+---------------+------------+-------------+
Best Answer
Yes, it makes sense to materialize.
The analysis of large datasets (see: OLAP, dimensional modeling) includes the concept of aggregations - which can be implemented as materialized views. You should design what aggregates will you keep. In my opinion you need at least two:
You can always calculate higher-level information from lower-lever aggregate. For example if you had an aggregate on (app_id, day, collection_id) you could use it instead of (app_id, day).
You can materialize your aggregates with
MATERIALIZED VIEW
feature. But this is not an only way. If old data is static, it could be enough to insert new rows daily, with something similar toThere is a lot of good sources on to partition or not to partition.
If you are going to store 2 years and more, daily partitions could be optimal. But remember that very large number of partitions will make query planning longer. Threshold depends on CPU speed / queries used.
PS. I assume that you ran out of normal ways to optimize: