PostgreSQL – Database Architecture and Materialized Views

materialized-viewperformancepostgresqlpostgresql-performance

I have a database with a huge table, that gathers the ranking history of mobile applications. The table is quite big, around 120 Go.
In order for my DB queries not to be too slow, I implemented several materialized views.
One of these materialized calculates the average rankings of apps over the last 30 days. It is refreshed everyday.

I now want to be able to tell at any point in time what that average was on a particular date. I.e, to have an history of averages.

Would it make sense to add the results of my materialized view everyday to a table ? Or should I partition that big table and do it another way ?

Edit : Main table structure (738,681,765 rows)

+----+--------+---------+------+-------+---------------+------------+-------------+
| id | app_id | ranking | date | price | collection_id | country_id | category_id |
+----+--------+---------+------+-------+---------------+------------+-------------+
|  1 |  1426  |   30    |  t1  |   0   |      12451    |   1658     |   2564      |
|  2 |  1427  |   15    |  t2  |   0   |      23562    |   1485     |   3256      |
|  3 |  1428  |   22    |  t3  |   0   |      14564    |   1320     |   4521      |
|  4 |  1429  |   11    |  t4  |   0   |      12468    |   1578     |   5015      |
|  5 |  1430  |   10    |  t5  |   0   |      18712    |   1100     |   6012      |
+----+--------+---------+------+-------+---------------+------------+-------------+

Best Answer

Would it make sense to add the results of my materialized view everyday to a table ?

Yes, it makes sense to materialize.

The analysis of large datasets (see: OLAP, dimensional modeling) includes the concept of aggregations - which can be implemented as materialized views. You should design what aggregates will you keep. In my opinion you need at least two:

  • by (app_id) - full-time history of single app. all-time ranking, trends, etc etc.
  • by (app_id, day) - daily information for every single app. This can include average daily rankings and all other information that's relevant on this level. can also include trend analysis and al other daily info you can imagine.

You can always calculate higher-level information from lower-lever aggregate. For example if you had an aggregate on (app_id, day, collection_id) you could use it instead of (app_id, day).

You can materialize your aggregates with MATERIALIZED VIEW feature. But this is not an only way. If old data is static, it could be enough to insert new rows daily, with something similar to

INSERT INTO data_by_app_id_day (app_id, day, avg_ranking)
SELECT app_id, tscol::date, avg(ranking)
FROM main_table
WHERE tscol::date = current_date - 1
GROUP BY 1,2;

Or should I partition that big table and do it another way ?

There is a lot of good sources on to partition or not to partition.

If you are going to store 2 years and more, daily partitions could be optimal. But remember that very large number of partitions will make query planning longer. Threshold depends on CPU speed / queries used.

PS. I assume that you ran out of normal ways to optimize:

  • indexes on all columns used in WHERE clause
  • partial indexes if needed
  • expression indexes if needed
  • optimal data types used
  • avoiding unnecessary data detail that explodes your dataset size
Related Question