I have a table articles
:
Table "articles"
Column | Type | Modifiers | Storage | Stats target | Description
----------------+-----------------------------+----------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('articles_id_seq'::regclass) | plain | |
user_id | integer | | plain | |
title | character varying(255) | | extended | |
author | character varying(255) | | extended | |
body | text | default '--- [] +| extended | |
| | '::text | | |
created_at | timestamp without time zone | | plain | |
updated_at | timestamp without time zone | | plain | |
published_date | timestamp without time zone | | plain | |
Indexes:
"articles_pkey" PRIMARY KEY, btree (id)
"index_articles_on_published_date" btree (published_date)
"index_rents_on_user_id" btree (user_id)
"index_articles_on_user_id_and_published_date" btree (user_id, published_date)
We're on Postgres 9.4.4. The machine has 3.5 GB of memory and 150 GB of disk space on an SSD.
Note: The 'published_date' is always rounded, by the application, to the nearest date. All hours/minutes/seconds are always 00. Legacy. Needs fixed. Etc.
This table has hundreds of millions of articles. The table receives a great deal of read queries from (as many as 16) concurrent processes performing the following queries as quickly as our system will respond:
-
a count of the total number of articles
SELECT COUNT(*) FROM articles;
-
a select of all articles published for a given user
SELECT * FROM articles WHERE user_id = $1;
-
a select of the most recently published article for a given user
SELECT * FROM articles WHERE user_id = $1 ORDER BY published_date DESC LIMIT 1;
I am finding that, with a large number of workers, these queries are quite slow. (At peak load, the first takes minutes to complete; the other two are on the order of 10 seconds.) In particular, it seems that queries are being enqueued.
The question
In the abstract, do tables with only fixed width values perform read queries better than those with varying widths? (Pretend disk space isn't an issue.) In my case, I'm wondering if I would see a performance improvement if I were to extract the 'body' text field to a separate table and transform the character varying fields into fixed width character fields.
I admit the question is a bit cargo cult-y. I simply don't know enough about the internals of the Postgres DB engine to construct an informed hypothesis. I do intend to perform real experiments with different schemas and configurations but I'd like to have a solid mental model of how Postgres actually works before I go much further.
Related question
Where can I learn more about the internals of the Postgres DB engine? I've Googled variations of the above question with little success. What are the correct terms to use for this search? Does this level of documentation exist only in source and the minds of Postgres DBAs? I also humbly invite the suggestion of good books on the topic.
Best Answer
Basically no. There are very minor costs when accessing columns, but you won't be able to measure any difference. Details:
In particular:
character varying(255)
andtext
at all. You seem to be under the impression thatvarchar(255)
(unliketext
) might be a "fixed-width" type, but that is not so. Both are variable-length types,varchar(255)
just adds a maximum length check:The use of
varchar(255)
in a table definition typically indicates a lack of understanding of the Postgres type system. The architect behind it is most probably not a native speaker - or the layout has been carried over from another RDBMS like SQL Server where this used to matter.SELECT COUNT(*) FROM articles
does not even consider row data at all, only the total size matters indirectly. Counting all rows is costly in Postgres due to its MVCC model. Maybe an estimate is good enough, which can be had very cheaply?Disk space is always an issue, even if you have plenty. The size on disk (number of data pages that have to be read / processed / written) is one of the most important factors for performance.
The info page for the tag postgres has the most important links to more information, including books, the Postgres Wiki and the excellent manual. The latter is my personal favorite.
Your third query has issues
ORDER BY published_date DESC
, butpublished_date
can be NULL (noNOT NULL
constraint). That's a loaded foot-gun if there can be NULL values, unless you prefer NULL values over the latest actualpublished_date
.Either add a
NOT NULL
constraint. Always do that for columns that can't be NULL.Or make that
ORDER BY published_date DESC
NULLS LAST
and adapt the index accordingly.Details in this recent, related answer:
Convert
published_date
to an actualdate
While
'published_date' is always rounded
, it's effectively just adate
which occupies 4 bytes instead of 8 for thetimestamp
. You would best move that up in the table definition to come before the twotimestamp
columns, so you don't lose the 4 bytes to padding:Smaller on-disk storage does make a difference for performance.
More importantly, your index on
(user_id, published_date)
would now just occupy 32 bytes per index entry instead of 40, because 2x4 bytes do not incur extra padding. And that would make a noticeable difference for performance.Aside: this index is not relevant to the demonstrated queries. Delete unless indexes unless used elsewhere:
"index_articles_on_published_date" btree (published_date)