PostgreSQL – Managing and Speeding Up Queries on Large Table

database-designdatabase-recommendationindexpartitioningpostgresql

I have time series data which spans over 10 years and has over 3 trillion rows and 10 columns.

At the moment I use a PCIe SSD with 128GB of RAM and I am finding that querying takes a significant amount of time. For example running the below command takes well over 15 mins:

SELECT * FROM tbl WHERE column_a = 'value1' AND column_b = 'value2';

The table is mostly used for reads. The only time the table is written to is during weekly updates which insert about 15 million rows.

What are the best ways to manage tables so large? Would you recommend splitting it by year?

The table size is 542 GB and the external size is 109 GB.

EXPLAIN (BUFFERS, ANALYZE) output:

"Seq Scan on table  (cost=0.00..116820941.44 rows=758 width=92) (actual time=0.011..1100643.844 rows=667 loops=1)"
"  Filter: (("COLUMN_A" = 'Value1'::text) AND ("COLUMN_B" = 'Value2'::text))"
"  Rows Removed by Filter: 4121893840"
"  Buffers: shared hit=2 read=56640470 dirtied=476248 written=476216"
"Total runtime: 1100643.967 ms"

The table was created using the following code:

CREATE TABLE tbl (
  DATE     timestamp with time zone,
  COLUMN_A text,
  COLUMN_B text,
  VALUE_1  double precision,
  VALUE_2  double precision,
  VALUE_3  double precision,
  VALUE_4  double precision,
  VALUE_5  double precision,
  VALUE_6  double precision,
  VALUE_7  double precision,
);

CREATE INDEX ix_table_name_date ON table_name (DATE);

Best Answer

Your existing index on DATE is obviously useless for the query. The first obvious step for your query:

SELECT * FROM tbl WHERE column_a = 'value1' AND column_b = 'value2';

is an index for column_a or column_b (which ever is more selective) or possibly a multicolumn index on (column_a, column_b), like:

CREATE INDEX tbl_a_b_idx ON tbl(column_a, column_b);

Details:

Next, I would consider partial indexes if you can safely exclude large parts of the table from the queries. Or partition your table in something like 100 partitions. And use constraint exclusion.

"Splitting it by year", like you contemplated, would be just as useless (harmful, even) as your index on DATE for the given query. Partitions need to be based on columns in the predicates of the query (column_a and column_b) to be helpful. Predicates on just one column would be much easier in this respect. If you have important queries filtering on various columns, partitioning is probably not the way to go. (Partial indexes still might.)

If partitioning is no good, a less invasive measure would be to CLUSTER data based on the new index (you can't use a partial index for this). Or simply create a new table from the sorted output of a query. This is particularly interesting since your table is mostly read-only. It should pay to do this at least once, but that one time will be very expensive: the whole table has to be re-written one or the other way. You need enough free space, as much RAM as you can get and an exclusive lock on the table. Or use pg_repack, to avoid the exclusive lock:

Be sure to run the latest version of Postgres. The upcoming Postgres 9.5 might be particularly interesting for you, since it introduces BRIN indexes (block range indexes), which can reduce the index size dramatically for very big tables. Might be exactly what you are looking for.