Postgresql – memory overhead using partitions in postgres vs. a single table

concurrencymemorypartitioningpostgresqlpostgresql-9.6

Background

I am ingesting time series data totalling to about 10M rows per day, where each row has a timestamp, a player id and some other columns.

That data is then queried via an API for a specific player and a time range (e.g. for last 90 days for player xxx), most cases require more than 1 concurrent request (normally around ~10 concurrent requests are executed)

I am running postgres 9.6.17, machine has 500GB hdd space (with about 15% available space at all time), 8 cores and 16GB of ram. work_mem is set to 2GB, cache_size to 12GB, max connections is set to 100 etc.

The API is 20 python gunicorn workers running flask and sqlalchemy+psycopg2 on a separate machine. Each worker has a pool of 2 connections to the DB allowing for an overflow of 5. Pool settings used to be higher but it turned out there is almost no benefit to using pools, hence the low numbers.

Naive approach

Initially, I put all the data into a single table and indexing both on timestamp and player. This approach worked fine until the amount of data started to make it slower and slower (for obvious reasons). This lead the API workers to timeout and return 500. A cost of a typical query (single player data over 6 months) as returned by explain would be around 1M, example below

Bitmap Heap Scan on player_data  (cost=515553.98..585514.47 rows=80037 width=32)
   Recheck Cond: (((player_id)::text = 'xxx'::text) AND (ts >= 1572566400) AND (ts < 1574899199))
   ->  BitmapAnd  (cost=515553.98..515553.98 rows=62819 width=0)
         ->  Bitmap Index Scan on idx_player_id  (cost=0.00..12749.35 rows=962837 width=0)
               Index Cond: ((player_id)::text = 'xxx'::text)
         ->  Bitmap Index Scan on idx_ts  (cost=0.00..502778.48 rows=37691480 width=0)
               Index Cond: ((ts >= 1572566400) AND (ts < 1574899199))

Better approach with partitions

As an improvement I started storing data in partitions, one per day and then creating the player id and timestamp index on every partition instead. This reduced significantly the query times, improvements were visible also with explain:

Append  (cost=0.00..85192.02 rows=80037 width=32)
   ->  Seq Scan on player_data  (cost=0.00..0.00 rows=1 width=85)
         Filter: ((ts >= 1572566400) AND (ts < 1574899199) AND ((player_id)::text = 'xxx'::text)
   ->  Index Scan using idx_player_id_20191104 on player_data_20191104  (cost=0.43..317.61 rows=280 width=32)
         Index Cond: ((player_id)::text = 'xxx'::text)
         Filter: ((ts >= 1572566400) AND (ts < 1574899199)
........<continued for a while>................

The cost was almost an order of magnitude lower and everything was going well, especially for queries of smaller time granularity.

Issues

After a while the users started complaining that the queries were not working, meaning the API started returning 500 errors. After checking the logs I noticed a lot of the following errors:

sqlalchemy.exc.OperationalError: (psycopg2.errors.OutOfMemory) out of memory
DETAIL:  Failed on request of size NNNN.

Note that NNNN can be pretty much any integer between 1 and 10000. Once the error manifested itself it would happen to all queries executed concurrently, rendering the DB unresponsive for a while (few minutes). Normally a restart of the API service would restore things to normal. When the DB is unresponsive the same error is returned for any kind of queries and access, including psql connect attempts from my machine etc.

I should also note the queries that run in parallel mostly hit the same indexes and partitions, e.g. different player ids over same time range, sometimes the queries are identical except for offset/limit values.

Another note – same errors show up regardless of concurrency coming via API or directly psql.

Debugging and diagnostics

I've checked the resources on the DB machine (since it's supposed to run out of memory) while querying:

  • available RAM never drops under 60%
  • no swap is ever used, nothing gets written to disk (although, it should if postgres runs out of available mem)
  • concurrent connection count to DB tops at 100 during testing, idles at 40
  • single partition index size is about 5MB, the index size of the old table was almost 1GB

Also, I've tried the following changes to the DB:

  • increase work_mem from 20MB to 2GB, no improvement
  • increase RAM from 8GB to 12GB and finally to 16GB, no improvement

I've also compared the old vs. the new query, with the following results:

  • more than 10 concurrent queries towards the partitioned table caused the error to manifest
  • 200 concurrent queries to the old, single table caused no issues whatsoever

Question(s)

So the questions I guess are:

  • does the use of partitions create some memory overhead causing the queries to fail due to memory allocation failure?
  • Is it the indexes? Is using multiple indexes worse than having one big one?
  • Does this theory even make sense or am I missing something obvious?

Best Answer

(normally around ~10 concurrent requests are executed)

... 16GB of ram. work_mem is set to 2GB,

Your work_mem setting seems nuts to me. With 10 concurrent queries and each one might use multiple instances of work_mem (especially common with partitioning--so to answer your title question, yes there is), it would be no surprise to run out of memory.

I've checked the resources on the DB machine (since it's supposed to run out of memory) while querying:

How did you do this? The data from different tools need to be interpreted in different ways.

Also, I've tried the following changes to the DB:

increase work_mem from 20MB to 2GB, no improvement

Right, increasing the amount of memory you use is unlikely to fix an out of memory problem. "The beatings will continue until morale improves."

But are you saying you had the exact same problem back when work_mem was set to only 20MB? Did you look in the database log file to see what it said about the problem directly (as opposed to what python passed along to you about it?)

From your description, it sounds like you might not need partitioning at all. Just a multicolumn index on (player_id, ts) would have likely fixed your problem without taking on the burden and overhead of partitioning.

Do you ever plan to delete/archive old data, or will in accumulate indefinitely?