Postgresql – Do fixed-width rows improve PostgreSQL read performance

datatypesperformancepostgresqlpostgresql-9.4query-performance

I have a table articles:

                                                       Table "articles"
     Column     |            Type             |                     Modifiers                      | Storage  | Stats target | Description
----------------+-----------------------------+----------------------------------------------------+----------+--------------+-------------
 id             | integer                     | not null default nextval('articles_id_seq'::regclass) | plain    |              |
 user_id        | integer                     |                                                    | plain    |              |
 title          | character varying(255)      |                                                    | extended |              |
 author         | character varying(255)      |                                                    | extended |              |
 body           | text                        | default '--- []                                   +| extended |              |
                |                             | '::text                                            |          |              |
 created_at     | timestamp without time zone |                                                    | plain    |              |
 updated_at     | timestamp without time zone |                                                    | plain    |              |
 published_date | timestamp without time zone |                                                    | plain    |              |

Indexes:
    "articles_pkey" PRIMARY KEY, btree (id)
    "index_articles_on_published_date" btree (published_date)
    "index_rents_on_user_id" btree (user_id)
    "index_articles_on_user_id_and_published_date" btree (user_id, published_date)

We're on Postgres 9.4.4. The machine has 3.5 GB of memory and 150 GB of disk space on an SSD.

_{Note: The 'published_date' is always rounded, by the application, to the nearest date. All hours/minutes/seconds are always 00. Legacy. Needs fixed. Etc.}

This table has hundreds of millions of articles. The table receives a great deal of read queries from (as many as 16) concurrent processes performing the following queries as quickly as our system will respond:

a count of the total number of articles
```
SELECT COUNT(*) FROM articles;
```
a select of all articles published for a given user
```
SELECT * FROM articles WHERE user_id = $1;
```

a select of the most recently published article for a given user

SELECT * FROM articles WHERE user_id = $1 ORDER BY published_date DESC LIMIT 1;

I am finding that, with a large number of workers, these queries are quite slow. (At peak load, the first takes minutes to complete; the other two are on the order of 10 seconds.) In particular, it seems that queries are being enqueued.

The question

In the abstract, do tables with only fixed width values perform read queries better than those with varying widths? (Pretend disk space isn't an issue.) In my case, I'm wondering if I would see a performance improvement if I were to extract the 'body' text field to a separate table and transform the character varying fields into fixed width character fields.

I admit the question is a bit cargo cult-y. I simply don't know enough about the internals of the Postgres DB engine to construct an informed hypothesis. I do intend to perform real experiments with different schemas and configurations but I'd like to have a solid mental model of how Postgres actually works before I go much further.

Your third query has issues

SELECT * FROM articles WHERE user_id = $1 ORDER BY published_date DESC LIMIT 1;

ORDER BY published_date DESC, but published_date can be NULL (no NOT NULL constraint). That's a loaded foot-gun if there can be NULL values, unless you prefer NULL values over the latest actual published_date.

Either add a NOT NULL constraint. Always do that for columns that can't be NULL.
Or make that ORDER BY published_date DESCNULLS LAST and adapt the index accordingly.

"articles_user_id_published_date_idx" btree (user_id, published_date DESC NULLS LAST)

Details in this recent, related answer:

Extremely slow query on indexed column in Postgres

Convert `published_date` to an actual `date`

While 'published_date' is always rounded, it's effectively just a date which occupies 4 bytes instead of 8 for the timestamp. You would best move that up in the table definition to come before the two timestamp columns, so you don't lose the 4 bytes to padding:

...
body           | text
published_date | date   --     <---- here
created_at     | timestamp without time zone
updated_at     | timestamp without time zone

Smaller on-disk storage does make a difference for performance.

Configuring PostgreSQL for read performance

More importantly, your index on (user_id, published_date) would now just occupy 32 bytes per index entry instead of 40, because 2x4 bytes do not incur extra padding. And that would make a noticeable difference for performance.

Aside: this index is not relevant to the demonstrated queries. Delete unless indexes unless used elsewhere:

~~"index_articles_on_published_date" btree (published_date)~~

Data alignment and storage size

Actually, the overhead per index tuple is 8 byte for the tuple header plus 4 byte for the item identifier.

We have three columns for the primary key:

PRIMARY KEY ("Timestamp" , "TimestampIndex" , "KeyTag")

"Timestamp"      timestamp (8 bytes)
"TimestampIndex" smallint  (2 bytes)
"KeyTag"         integer   (4 bytes)

Results in:

 4 bytes for item identifier in the page header (not counting towards multiple of 8 bytes)

 8 bytes for the index tuple header
 8 bytes "Timestamp"
 2 bytes "TimestampIndex"
 2 bytes padding for data alignment
 4 bytes "KeyTag" 
 0 padding to the nearest multiple of 8 bytes
-----
28 bytes per index tuple; plus some bytes of overhead.

About measuring object size in this related answer:

Measure the size of a PostgreSQL table row

Order of columns in a multicolumn index

Read these two questions and answers to understand:

The way you have your index (primary key), you can retrieve rows without a sorting step, that's appealing, especially with LIMIT. But retrieving the rows seems extremely expensive.

Generally, in a multi-column index, "equality" columns should go first and "range" columns last:

Multicolumn index and performance

Therefore, try an additional index with reversed column order:

CREATE INDEX analogransition_mult_idx1
   ON "AnalogTransition" ("KeyTag", "TimestampIndex", "Timestamp");

It depends on data distribution. But with millions of row, even billion of rows this might be substantially faster.

Tuple size is 8 bytes bigger, due to data alignment & padding. If you are using this as plain index, you might try to drop the third column "Timestamp". May be a bit faster or not (since it might help with sorting).

You might want to keep both indexes. Depending on a number of factors, your original index may be preferable - in particular with a small LIMIT.

autovacuum and table statistics

Your table statistics need to be up to date. I am sure you have autovacuum running.

Since your table seems to be huge and statistics important for the right query plan, I would substantially increase the statistics target for relevant columns:

ALTER TABLE "AnalogTransition" ALTER "Timestamp" SET STATISTICS 1000;

... or even higher with billions of rows. Maximum is 10000, default is 100.

Do that for all columns involved in WHERE or ORDER BY clauses. Then run ANALYZE.

Table layout

While being at it, if you apply what you have learned about data alignment and padding, this optimized table layout should save some disk space and help performance a little (ignoring pk & fk):

CREATE TABLE "AnalogTransition"(
  "Timestamp" timestamp with time zone NOT NULL,
  "KeyTag" integer NOT NULL,
  "TimestampIndex" smallint NOT NULL,
  "TimestampQuality" smallint,
  "UpdateTimestamp" timestamp without time zone, -- (UTC)
  "QualityFlags" smallint,
  "Quality" boolean,
  "Value" numeric
);

`CLUSTER` / pg_repack / pg_squeeze

To optimize read performance for queries that use a certain index (be it your original one or my suggested alternative), you can rewrite the table in the physical order of the index. CLUSTER does that, but it's rather invasive and requires an exclusive lock for the duration of the operation.
pg_repack is a more sophisticated alternative that can do the same without exclusive lock on the table.
pg_squeeze is a later, similar tool (have not used it, yet).

This can help substantially with huge tables, since much fewer blocks of the table have to be read.

RAM

Generally, 2GB of physical RAM is just not enough to deal with billions of rows quickly. More RAM might go a long way - accompanied by adapted setting: obviously a bigger effective_cache_size to begin with.

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Postgresql – Do fixed-width rows improve PostgreSQL read performance

The question

Related question

Best Answer

Your third query has issues

Convert `published_date` to an actual `date`

Related Question

The question

Related question

Best Answer

Your third query has issues

Convert published_date to an actual date

Related Solutions

Postgresql – Configuring PostgreSQL for read performance

Data alignment and storage size

Order of columns in a multicolumn index

autovacuum and table statistics

Table layout

CLUSTER / pg_repack / pg_squeeze

RAM

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Related Question

Convert `published_date` to an actual `date`

`CLUSTER` / pg_repack / pg_squeeze