Postgresql – Schema Design – would you split this table

database-designoptimizationpartitioningpostgresql

Table A

- auto-incrementing integer primary key
- 3 dates
- 17 booleans
- 7 varchars (less than 40 bytes in these combined varchars in 99% of rows)
- 5 foreign keys

Table B

- auto-incrementing integer primary key
- 1 date
- 2 boolean fields
- 3 varchars (1 unique UUID string in each row, other 2 varchars empty 99% of the time)
- 3 foreign keys
  1. Table A has 25 million rows – SELECT COUNT(*) takes approx 70 seconds
  2. Table B has 20 million rows – SELECT COUNT(*) takes approx 6-7 seconds

I only included Table B to illustrate my assuption, which is that the 900% more time it takes to count Table A, despite having only 25% more rows than Table B, is that Table A has 33 columns, compared with only 10 columns in Table B. I realize 33 columns isn't a "big" table, but it might be bigger than we need, since we only use a few of those columns in many operations.

We aren't experiencing any serious issues, aside from counting taking too long to be useful within the span of a web request (e.g., for pagination in a Django admin list view).

However, we are about to increase the size of Table A, from 25 million, by about 1-2 million rows per month, because of a new feature, which adds to our fears about performance in general.

We're considering the following options:

Split Table A

We'd put most of Table A's columns into a new child table (Table A Meta) with a 1-to-1 relation to Table A. Only 4-6 most commonly used columns would remain in Table A, and presumably this would help make read queries and counting quicker at the expense of needing a separate query (or JOIN) to get the additional details for a given row, as well as a small amount of additional write overhead.

Wait for trouble, then partition

The other option is to ignore the fact that SELECT COUNT(*) is basically unusable, since other things are functioning properly (querying via indexed fields, etc.), and continue using Table A even past 50, 75 million rows, etc., eventually partitioning at some point in the future.

I realize these options aren't mutually exclusive, but I'd like to get some general feedback on those options and whether you would bother to do the first, etc.

Execution

Here is the execution plan for the longer COUNT mentioned:

db=# EXPLAIN ANALYZE SELECT COUNT(*) FROM "message";
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1236008.12..1236008.14 rows=1 width=8) (actual time=79357.273..79357.293 rows=1 loops=1)
   ->  Seq Scan on message  (cost=0.00..1173727.10 rows=24912410 width=0) (actual time=1.460..65375.321 rows=24926666 loops=1)
 Planning time: 0.543 ms
 Execution time: 79357.350 ms
(4 rows)

Best Answer

count(*) is always slow, so don't do it unless you have to.

The most important factor is usually the physical size of the table, so check that for an explanation for what you observe.

Splitting the table along a one-to-one relationship might speed up counting, but will slow down other queries, which seems like a bad deal.

Partitioning won't help, but done correctly, it can make it easy get rid of old data. Using a recent PostgreSQL version and parallel query with a lot of processes is more promising for fast counting.