Postgresql – Is C collation suitable for the Postgres database

postgresqlpostgresql-9.4

I am working with 500m rows (120GB) in Postgres 9.4, and I am trying to make SELECT statements as fast as possible. I am less concerned about the speed of INSERT and UPDATE, because my data only changes once a quarter and I'll probably use a backup database and switch over when that happens.

I'm working on a Debian Wheezy server, with 32GB of RAM and 320GB of solid-state. I have set maintenance_work_mem to 16GB while setting up the database, and shared_buffers to 12GB.

I have found that creating indexes is slow, and even with indexes, SELECT queries are still quite slow. To help this, I've been recommended to use the C collation for the database.

I've been reading through the collation documentation, but it doesn't answer all my questions. So here's what I want to do with my database:

  • Support fast SELECT * FROM mytable WHERE col1='020406AAA' queries.
  • Support fast SELECT * FROM mytable WHERE col1 LIKE '02%' queries.
  • Allow non-ASCII characters in some columns (e.g. Ménière's disease). This isn't absolutely critical, though – I could convert these to ASCII if it would really help for speed.
  • Add the PostGIS extensions to the database.

Are any of these contra-indications for using the C collation?

I'm expecting all the columns where I want to use LIKE queries to consist of digits and uppercase letters only.

The data is a UK dataset, in English, for UK-only users. So I'm not expecting to have to add internationalization.

Best Answer

From what you're describing, it doesn't sound like anything is contraindicated related to the C collation order. In fact, Robert Haas, has a great discussion on the subject here, which seems to come to the conclusion that if you don't need locale aware sorting, just don't use it: Perils of Collation Aware Comparisons

I did notice you are on Debian Wheezy. The stock kernel for Debian Wheezy is 3.2, which if you're using that, has abysmal performance on database workloads. I would investigate that along with the collation testing, if you are not happy with the performance of your machine.

The wheezy-backports kernel is 3.16, which does not have these issues. Discussion and performance tests are here: Benchmarking PostgreSQL with Different Kernel Versions

Hope that helps. =)