Postgresql – Effectively handle 10-100 millions row table of unrelated data

partitioningperformanceperformance-tuningpostgresqlsharding

What are the common approaches to boost read/write performance of table with up to 100 millions of rows?

Table has columnSEGMENT_ID INT NOT NULL, where each segment has about 100.000-1.000.000 rows. Writes – all rows for SEGMENT_ID are inserted at once, no updates for SEGMENT_ID afterwards. Reads – are pretty often, I need good performance for SELECT * FROM table WERE SEGMENT_ID = ?.

The most obvious approach is creating new table for SEGMENT_ID dynamically, but dynamic tables means hacks with ORM or even native SQL query-framework. In other words you finish with code that smells.

You can also use sharding, right? Does database create new tables under the hood?

I can cluster the table by SEGMENT_ID. But will my inserts be clustered if I insert all segment-related data at once?

Also Postgres propose to use partitioning to handle very big tables.

Maybe there is some kind of magical index which will help me to avoid creating new tables dynamically or configuring sharding?

Any other options?

Best Answer

Using a simple BRIN index

TIAS.

Here is a table exactly as you described, worst case 100 million rows with 1 million rows per SEGMENT_ID

explain analyze
CREATE TABLE foo AS
  SELECT (x::int%100)::int AS SEGMENT_ID
  FROM generate_series(1,100e6) AS gs(x);

                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series gs  (cost=0.00..15.00 rows=1000 width=32) (actual time=21740.904..57589.405 rows=100000000 loops=1)
 Planning time: 0.043 ms
 Execution time: 96685.350 ms
(3 rows)

That means we created the table in 1.5 mins. Here we're adding an index.

CREATE INDEX ON foo
  USING brin (SEGMENT_ID);
VACUUM ANALYZE foo;

Then we add another million rows. SEGMENT_ID = 142

explain analyze
INSERT INTO foo(SEGMENT_ID)
  SELECT 142
  FROM generate_series(1,1e6) AS gs(x);

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Insert on foo  (cost=0.00..10.00 rows=1000 width=0) (actual time=1489.958..1489.958 rows=0 loops=1)
   ->  Function Scan on generate_series gs  (cost=0.00..10.00 rows=1000 width=0) (actual time=174.690..286.331 rows=1000000 loops=1)
 Planning time: 0.043 ms
 Execution time: 1499.529 ms
(4 rows)

Adding a million rows took 1.5 seconds.. Now we select,

explain analyze
SELECT *
  FROM foo
  WHERE SEGMENT_ID=142;

                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=52.00..56.01 rows=1 width=4) (actual time=4.401..140.874 rows=1000000 loops=1)
   Recheck Cond: (segment_id = 142)
   Rows Removed by Index Recheck: 24832
   Heap Blocks: lossy=4535
   ->  Bitmap Index Scan on foo_segment_id_idx  (cost=0.00..52.00 rows=1 width=0) (actual time=1.504..1.504 rows=46080 loops=1)
         Index Cond: (segment_id = 142)
 Planning time: 0.059 ms
 Execution time: 176.902 ms
(8 rows)

Selecting a million rows took 176 ms.

That's on a 5 year old x230 with a "Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz" and a single SSD. You can pick one up for a few hundred bucks and install Xubuntu it. Not exactly hard science either. I'm compiling Angular apps in the background.