Postgresql – Truncate text encoding safe

postgresql

What would be the most efficient way, within Postgres, to truncate a text field to a maximum number of bytes, in such a way that the encoding is preserved?
I.e. how to keep at most N bytes while keeping a valid UTF8 string representation.

E.g. assuming UTF8, if a field contains abc€, that's 4 characters for 5 bytes, if I want to truncate this field to at most 4 bytes, but keep a valid UTF8 string, I'd need to actually keep only the first 3 bytes.

I feel like I need the following, but instead of 'escape', I'd like to ignore the invalid bytes to only keep abc:

select encode(substring('abc€'::bytea, 0, 5), 'escape');
 encode  
---------
 abc\342

What's the best way to achieve this?

Best Answer

Use this function:

CREATE OR REPLACE FUNCTION get_prefix (
   string text,
   max_bytes bigint
) RETURNS text
   LANGUAGE sql STRICT AS
$$SELECT p
FROM (SELECT p.p, octet_length(p.p) AS len
      FROM generate_series(0, length($1)) AS len
         CROSS JOIN LATERAL substr($1, 1, len.len) AS p) AS q
WHERE len <= $2
ORDER BY len DESC
LIMIT 1$$;

It works like this:

SELECT get_prefix('abc€', 4);
 get_prefix 
------------
 abc
(1 row)