PostgreSQL – Why ENCODE on BYTEA Has No Performance Impact

byteadatabase-designdatatypesperformancepostgresqlpostgresql-performance

I am working on a scenario to store strings in hex format of fixed length 64 to database. Obviously the choices are BYTEA and CHAR(64).

Initial thought was to enforce a valid hex string storing it as BYTEA will be good idea, but impact I evaluated was the use of ENCODE on select queries.

I did some performance benchmarks, considering both tables have few million rows;

# A file query_with_char.sql
SELECT "key" FROM table_varchar;;

# A file query_with_binary.sql
SELECT ENCODE("key", 'hex') FROM table_binary;

pgbench -c 30 -T 120 -n -f ./query_with_binary.sql -f ./query_with_char.sql -P 5 -S my_db

SQL script 1: ./query_with_binary.sql
 - weight: 1 (targets 33.3% of total)
 - 236 transactions (34.6% of total, tps = 1.876072)
 - latency average = 8896.888 ms
 - latency stddev = 2548.701 ms
SQL script 2: ./query_with_varchar.sql
 - weight: 1 (targets 33.3% of total)
 - 225 transactions (33.0% of total, tps = 1.788628)
 - latency average = 7164.604 ms
 - latency stddev = 2209.866 ms

I am unable to understand why the performance of query ENCODE is faster compared to normal string. How is PostgreSQL able to do encoding on million rows faster than just fetching the string columns?

Can someone explain what might be wrong in the above test?

Best Answer

encode() is a very cheap function. I would not expect any measurable impact at all in your test.

The difference is almost certainly due to the much smaller storage size of bytea as compared to char(64). Consider:

SELECT pg_column_size('90b7525e84f64850c2efb407fae3f27190b7525e84f64850c2efb407fae3f271'::char(64)) AS size_char64
     , pg_column_size(decode(text '90b7525e84f64850c2efb407fae3f27190b7525e84f64850c2efb407fae3f271', 'hex')) AS size_bytea;

 size_char64 | size_bytea
-------------+------------
          68 |         36

The dominant factor for performance of simple SELECT queries is the number of data pages that have be read.

"Best" data type?

Obviously the choices are BYTEA and CHAR(64).

If your goal is to optimize performance, consider a third option:
2 uuid columns

To understand, first read:

And:

Then consider this demo (executed on pg 11, but true for all modern versions):

RAM size:

SELECT pg_column_size(t64)                   AS c_text
     , pg_column_size(t64::char(64))         AS c_char64
     , pg_column_size(decode(t64, 'hex'))    AS c_bytea
     , pg_column_size( left(t64, 32)::uuid)
     + pg_column_size(right(t64, 32)::uuid)  AS c_2x_uuid
FROM  (SELECT text '90b7525e84f64850c2efb407fae3f27190b7525e84f64850c2efb407fae3f271') t(t64);

 c_text | c_char64 | c_bytea | c_2x_uuid 
--------+----------+---------+-----------
     68 |       68 |      36 |        32

Disk size (compressed format):

CREATE TEMP TABLE c64 AS 
SELECT t64                    AS c_text
     , t64::char(64)          AS c_char64
     , decode(t64, 'hex')     AS c_bytea
     , left (t64, 32)::uuid   AS c_uuid1
     , right(t64, 32)::uuid   AS c_uuid2
FROM  (SELECT text '90b7525e84f64850c2efb407fae3f27190b7525e84f64850c2efb407fae3f271') t(t64);

SELECT pg_column_size(c_text)    AS c_text
     , pg_column_size(c_char64)  AS c_char64
     , pg_column_size(c_bytea)   AS c_bytea
     , pg_column_size(c_uuid1)
     + pg_column_size(c_uuid2)   AS c_2x_uuid
FROM   c64;

 c_text | c_char64 | c_bytea | c_2x_uuid 
--------+----------+---------+-----------
     65 |       65 |      33 |        32

db<>fiddle here

The seemingly minor difference between 33 and 32 bits can actually make a difference of 8 bytes, because several storage mechanisms require padding at multiples of 8 bytes.

Repeat your test with 2 UUIDs. I am confident it will come out on top.