PostgreSQL Performance – Speedup and Slowdown with Concurrent Queries

concurrencyperformancepostgresqlquery-performance

I approach you all humbly as one who is NOT a DBA, and I'm sure that my question is fraught with conceptual shortcomings and "it depends on" land mines. I'm also pretty sure that all of you who choose to answer are going to want a lot more in the way of specifics than I can currently deliver.

That said, I'm curious about the following scenario in general:

Say that I have two non-trivial queries.
Query 1 requires 2 minutes to complete on average.
Query 2 requires 5 minutes to complete on average.

If I run them serially, one right after the other, I'm expecting it will require 7 minutes to complete on average. Is this reasonable?

More than that, however, what if I run the two queries concurrently? Two separate connections at the same time.

Under what conditions would I expect to see a speedup? (Total time < 7 minutes)
Under what conditions would I expect to see a slowdown? (Total time > 7 minutes)

Now, if I had 1,000 non-trivial queries running concurrently, I have a hunch that it would result in an overall slowdown. In that case, where would the bottleneck likely be? Processor? RAM? Drives?

Again, I know it's probably impossible to answer the question precisely without knowing specifics (which I don't have.) I'm looking for some general guidelines to think about when asking the following questions:

Under what circumstances do concurrent queries result in an overall speedup?
Under what circumstances do concurrent queries result in an overall slowdown?

Best Answer

If I run them serially, one right after the other, I'm expecting it will require 7 minutes to complete on average. Is this reasonable?

If they use unrelated data sets, then yes.

If they share a data set, and the cache is cold for the first query and the query is mostly I/O bound, then the second one might complete in moments. You need to consider caching effects when dealing with performance analysis and query timing.

More than that, however, what if I run the two queries concurrently? Two separate connections at the same time.

"It depends".

If they were both using sequential scans of the same table then in PostgreSQL it'd be a huge performance win because of its support for synchronized sequential scans.

If they shared the same indexes then they'd likely benefit from each others' reads in to cache.

If they're independent and touch different data then they might compete for I/O bandwidth, in which case they might take the same amount of time as running sequentially. If the I/O subsystem benefits from concurrency (higher net throughput with more clients) then the total time might be less. If the I/O subsystem handles concurrency poorly then they might take longer than running them sequentially. Or they might not be I/O bound at all, in which case if there's a free CPU for each they could well execute as if the other wasn't running at all.

It depends a great deal on the hardware and system configuration, the data set, and on the queries themselves.

Now, if I had 1,000 non-trivial queries running concurrently, I have a hunch that it would result in an overall slowdown. In that case, where would the bottleneck likely be? Processor? RAM? Drives?

Yes, that'd very likely slow things down for a number of reasons.

PostgreSQL's own overheads in inter-process coordination, transaction and lock management, buffer management, etc. This can be quite a big cost, and PostgreSQL isn't really designed for high client counts - it works better if you queue work.
Competition for working memory, cache, etc.
OS scheduling overhead as it juggles 1000 competing processes all wanting time slices. Pretty minor these days, modern OSes have fast schedulers.
I/O thrashing. Most I/O systems have a peak performance client count. Sometimes it's 1, i.e. it's best with only one client, but it's often higher. Sometimes performance decreases again above the threshold. Sometimes it just reaches a plateau.

Related Solutions

Postgresql – Ways to speed up IN queries under PostgreSQL

IN() using many parameters will result in many cases in a sequential table scan. That might be slow, depending on table size and speed of your system.

Create a temporary table with all your variables and join on this table:

CREATE TEMP TABLE t AS 
  SELECT * FROM (VALUES(1),(2),(3)) x(twitter_user_id);

SELECT 
  twitter_personas.* 
FROM twitter_personas 
  JOIN t USING(twitter_user_id);

Use EXPLAIN to see the difference between the queryplans.

PostgreSQL Performance – Understanding Prepared Statement Overhead

The next to last major step was to find out that it is, unfortunately for MySQL users, only MySQL that joins slowly, so these write intensive tables were all given unique integer identifiers to join upon. That took off 1/3 of the remaining average time consumption reduction.

The last step was taking CTE optimization fencing into account which took off the last 2/3 of reduction.

In the case of the above set of queries, the average time consumed is now between 1 and 2ms on an i7 laptop tuned for SSDs even though I'm using the included hard drive. On a server with a single SSD, the time consumed averages less than 1ms.

To get the last performance boost, almost all queries were condensed, so for example if a table needed to be updated, it's best to do it in one query no matter how strange or performance reducing it might appear.

Reading to assemble relevant primary key values required more testing. If it was a combination of RETURNINGs, it was best to allow each sub-statement to recalculate it rather than depend upon another sub-statement to do the calculation. If the read had to go to disk, even if it was small, it was best to do that once and reference it across the multiple writing queries.

Copying was another grey area. If a table required inserts, updates, and deletes based upon data from another table, it was slower to pre-select the data. It was faster to simply reference the primary key data.

In general, aside from reads that go to disk to assemble primary key values to reference, it is best to compress a CTE as much as possible though it may appear strange to keep the chains as short and narrow as possible.

Multi-prepared statement transactions that previously consumed 10s of ms in my application now consume max 5ms.

This approach should probably be limited to small amounts of redundant data.

Best Answer

Related Solutions

Postgresql – Ways to speed up IN queries under PostgreSQL

PostgreSQL Performance – Understanding Prepared Statement Overhead

Related Question