To check what CLUSTER
does, I took a table fo mine from an earlier experiment which basically contained the first 10 million positive integers. I already deleted some rows and there is an other column as well but these only affect the actual table size, so it is not that interesting.
First, having run VACUUM FULL
on the table fka
, I took its size:
\dt+ fka
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | fka | table | test | 338 MB |
Then let's see the physical order of the data from the very beginning of the table:
SELECT *, ctid FROM fka ORDER BY ctid LIMIT 5;
id | col1 | ctid
-----+------+---------
2 | 2 | (0,1)
3 | 3 | (0,2)
4 | 4 | (0,3)
5 | 5 | (0,4)
6 | 6 | (0,5)
Now let's delete some rows:
DELETE FROM fka WHERE id % 10 = 5;
--DELETE 1000000
After this, the reported table size did not change. So let's see now what CLUSTER
does:
CLUSTER fka USING fka_pkey;
SELECT *, ctid FROM fka ORDER BY ctid LIMIT 5;
id | col1 | ctid
-----+------+---------
2 | 2 | (0,1)
3 | 3 | (0,2)
4 | 4 | (0,3)
6 | 6 | (0,4)
7 | 7 | (0,5)
After the operation the table size changed from 338 to 296 MB. From the ctid
column, which describes the physical place of the tuple in the page, you also see that there is no gap where the row matching id = 5
used to be.
As the tuples were reordered, indexes should have been recreated so that they point to the correct places.
So the difference looks to be that VACUUM FULL
does not order the rows. As far as I know, there is some difference in the mechanism the two commands use but from a practical point of view this seems to be the main (only?) difference.
The encoding
defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding
for this.
The locale
is a superset of settings, which can be split up for PostgreSQL into
LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME
The settings of particular interest for you are LC_COLLATE
(defines how strings are sorted) and LC_TYPE
(defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.
Best Answer
The PostgreSQL documentation leaves a lot to be desired (just sayin' ? ).
To start with, there is only one encoding for a particular database, so
C
andC.UTF-8
in your UTF-8 database are both using the UTF-8 encoding.For libc collations: typically collation names, by convention, are truly two-part names of the following structure:
{locale_name}.{encoding_name}
A "locale" (i.e. "culture") is the set of language-specific rules for sorting (
LC_COLLATE
) and capitalization (LC_CTYPE
). Even though there is sometimes overlap, this really doesn't have anything to do with how this data is stored.An "encoding" is how the data is stored (i.e. what byte sequence equates to which character). Even though there is sometimes overlap, this really doesn't have anything to do with the sorting and capitalization rules of any particular language that uses the encoding (some encodings can be used by multiple languages that can have quite different rules in one or both of those areas).
To illustrate, consider storing Korean data:
ko_KR
is the locale.EUC_KR
(Extended UNIX Code-KR)JOHAB
UHC
(Unified Hangul Code / Windows949)UTF8
(Unicode's 8-bit encoding)Also consider the following, taken from the "Collation Support: libc collations" documentation (emphasis added):
Meaning, in a database that uses the UTF-8 encoding,
en_US
anden_US.UTF8
are equivalent. BUT, between that database and a database that uses theLATIN1
encoding, theen_US
collations are not equivalent.So, does this mean that
C
andC.UTF-8
are the same?NO, that would be too easy!!! The
C
collation is an exception to the above-stated behavior. TheC
collation is a simple set of rules that is available regardless of the database's encoding, and the behavior should be consistent across encodings (which is made possible by only recognizing the US English alphabet — "a-z" and "A-Z" — as "letters", and sorting by byte value, which should be the same for the encodings available to you).The
C.UTF-8
collation is actually a slightly enhanced set of rules, as compared to the baseC
rules. This difference can actually be seen inpg_collation
since the values for thecollcollate
andcollctype
columns are different between the rows forC
andC.UTF-8
.I put together a set of test queries to illustrate some of the similarities and differences between these two collations, as well as compared to
en_GB
(and implicitlyen_GB.utf8
). I started with the queries provided in Daniel Vérité's answer, enhanced them to hopefully be clearer about what is and is not being shown, and added a few queries. The results show us that:C
andC.UTF-8
are actually different sets of rules, even if only slightly different, based on their respective values in thecollcollate
andcollctype
columns inpg_collation
(final query)C.UTF-8
expands the characters that are considered "letters"C.UTF-8
, unlikeC
(but likeen_GB
), recognizes invalid Unicode code points (i.e. U+0378) and sorts them towards the topC.UTF-8
, likeC
(but unlikeen_GB
), sorts non-US-English-letter characters by code pointucs_basic
appears to be equivalent toC
(which is stated in the documentation)You can find, and execute, the queries on: db<>fiddle