Postgresql – Postgres performance testing, caching and flushing

performancepostgresqlpostgresql-performance

I been doing some testing on various postgres SQL queries.

Testing often involves changing the syntax of queries, altering table joins, or occasional completely re-writing the query.

I've noticed that I sometimes get an 'apparent' big performance increase. I'll run a query, it'll take (say) 60 seconds to run, I'll make a minor change and it'll then take (say) 5 seconds to run.

At first I though that this was because my minor tweak had improved the performance. I've since realised that actually there must be some caching going on, (to see this, try running a 60 second query, and then running it again a few seconds later – it will always run quicker the 2nd time), I assume this is because the data has been cached locally somewhere so when the data needs reading a 2nd time it's already to hand.

I'm sure this a useful performance feature, but it does make it very hard to spot genuine performance improvements when tweaking a query. Is it possible to flush the cache before each execution to ensure that each test starts from the same position?

Thanks

Best Answer

PostgreSQL relies heavily on the OS cache as well as its own cache. So you would have to clear the PostgreSQL cache (restart the postgres service) and clear the OS cache (either restart the machine, or follow the method in the comments to clear without restarting if you are Linux).

But, why do you want to do it this way? If the query is parameterized, you could just change the parameters each time so they refer to a different part of the data. If the query is not parameterized, then why is the data it needs getting driven out of the cache between runs on the production machine?

When the different parameterizations have different performance due to different result sizes, the most definitive solution is to write a driver program that repeats the query several times with random (but realistic) parameterizations, and hopes that the parameter-induced variation averages out between your tuning settings. This is a pain, but when I have had to resort to it I generally came to wish I had to done it sooner. This is better than doing each query from a cold cache, because in the real production situation you are unlikely to have a fully cold cache. Certain parts of the data will be in common to all parameterizations, and that part will be "hot" enough to always be in the cache.