PostgreSQL – Efficient Ways to Count Number of Rows in a Table

countperformancepostgresqlquery-performance

I am using Postgres with the following query:

select count(*) from image;

The primary key on this table is non-incrementing; it's a unique serial number for the images stored in the table. Our app often attempts to ingest images that have already been recorded in the database, so the primary key/serial number ensures they are only recorded once.

Now we are wondering if we should have gone with an incrementing primary key instead. We have 1,259,369 images in the database and it takes about 7 minutes for the count query to run.

Our app will never delete images from this table – so an incrementing primary key would allow us to check the value of the last ID which would equal the number of rows in the table.

Best Answer

Generally, if you don't need an exact count, there is a much faster way:

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'image'::regclass;

Fast way to discover the row count of a table

As a matter of fact, in a DB with concurrent write access every count is an estimate, because the number may be outdated the instant you get it.

But, like @a_horse commented, there is something off in your DB. Counting a million should not take more than a few seconds in the worst case.

That your app will never delete images from this table makes this even more suspicious, because there shouldn't be many dead rows then. (Or are you updating a lot?) A huge amount of dead tuples could slow you down - and call for VACUUM. Normally, autovacuum takes care of this. Did you enable it? (It's the default in modern Postgres.)

Are regular VACUUM ANALYZE still recommended under 9.1?

Check for dead tuples:

Measure the size of a PostgreSQL table row

All the usual advice for performance optimization applies.

Related Solutions

Mysql – use a foreign key index as a shortcut to getting a row count in an INNODB table

You cannot trust the EXPLAIN plan's row count. Why?

Each time Query Optimization takes place where it involves an InnoDB table, the InnoDB Storage Engine does an approximation of the BTREE entries for the keys. They approximations change from query to query.

Simply run SHOW INDEXES FROM guage_data; multiple times and you will get a different set of row counts each time. You could set innodb_stats_on_metadata to 0:

SET GLOBAL innodb_stats_on_metadata = 0;

This will cause SHOW INDEXES FROM guage_data; to give the same number over and over again. This is still self-defeating because the row count would still be approximated anyway. You would have to disable the metadata stats and run ANALYZE TABLE. Again, this is self-defeating.

With innodb_stats_on_metadata on, approximations are always done. I wrote about this back on June 21, 2011 : From where does the MySQL Query Optimizer read index statistics?

In light of all this, you are better off doing SELECT id FROM guage_data if id is the PRIMARY KEY. However, notice in your EXPLAIN plan that MySQL choose to traverse the gauge_data_FI_1 index. This makes all the sense in the world if this table is InnoDB. Why? Every non-unique index stores the rowid of the PRIMARY KEY anyway.

Postgresql – Most efficient way of bulk deleting rows from postgres

Your second option is far cleaner and will perform well enough to make that worth it. Your alternative is to build gigantic queries which will be quite a pain to plan and execute. In general you are going to be better off letting PostgreSQL do the work here. In general, I have found updates on tens of thousands of rows in the manner you are describing to perform adequately, but there is one important thing to avoid doing.

The way to do it is to use a select and a join in your delete.

DELETE FROM foo WHERE id IN (select id from rows_to_delete);

Under no circumstances should you as follows with a large table:

DELETE FROM foo WHERE id NOT IN (select id from rows_to_keep);

This will usually cause a nested loop antijoin which will make performance rather problematic. If you end up having to go that route, do this instead:

DELETE FROM foo 
WHERE id IN (select id from foo f 
          LEFT JOIN rows_to_keep d on f.id = d.id
              WHERE d.id IS NULL);

PostgreSQL is usually pretty good at avoiding bad plans but there are still cases involving outer joins which can make a big difference between good and bad plans.

This is wandering a little more far afield, but I figure it is worth mentioning because of how easy it is to go from the IN to NOT IN and watch query performance tank.

Best Answer

Related Solutions

Mysql – use a foreign key index as a shortcut to getting a row count in an INNODB table

Postgresql – Most efficient way of bulk deleting rows from postgres

Related Question