Postgresql – Truncate text encoding safe

postgresql

What would be the most efficient way, within Postgres, to truncate a text field to a maximum number of bytes, in such a way that the encoding is preserved?
I.e. how to keep at most N bytes while keeping a valid UTF8 string representation.

E.g. assuming UTF8, if a field contains abc€, that's 4 characters for 5 bytes, if I want to truncate this field to at most 4 bytes, but keep a valid UTF8 string, I'd need to actually keep only the first 3 bytes.

I feel like I need the following, but instead of 'escape', I'd like to ignore the invalid bytes to only keep abc:

select encode(substring('abc€'::bytea, 0, 5), 'escape');
 encode  
---------
 abc\342

What's the best way to achieve this?

Best Answer

Use this function:

CREATE OR REPLACE FUNCTION get_prefix (
   string text,
   max_bytes bigint
) RETURNS text
   LANGUAGE sql STRICT AS
$$SELECT p
FROM (SELECT p.p, octet_length(p.p) AS len
      FROM generate_series(0, length($1)) AS len
         CROSS JOIN LATERAL substr($1, 1, len.len) AS p) AS q
WHERE len <= $2
ORDER BY len DESC
LIMIT 1$$;

It works like this:

SELECT get_prefix('abc€', 4);
 get_prefix 
------------
 abc
(1 row)

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

You are getting at something that is missing from standard normalization discussions, namely the constraint dependencies. In general wider tables provide greater possibilities here than narrower tables. So my view is that the sorts of questions you are asking in fact highlight good reasons to denormalize. I would go with your first solution (the one you are leaning towards right now).

In my view good database design generally normalizes as far as possible but ensures that all columns necessary for proper data constraints are included. You can do some of this with composite foreign keys if you don't mind adding additional unique constraints on the parent tables. Leveraging data constraints an important part of database design and one should not sacrifice that for the sake of normalization that looks good in theory.

Postgresql – Restoring SQL_ASCII dumps to a UTF8-encoded database

I suspect your database content may be in iso8859 or cp1252. If it were ascii, you would not run into problems importing it. You may be able to determine the coding by opening your dump with python. The following python3 tries encodings until it succeeds. It can be used to determine the file encoding.

for enc in ('cp1252', 'utf8'):
    print('Encoding', enc)
    try:
        file = open(fileName, 'r', encoding=enc)
        return file.read()
    except Exception as e:
        print(e)
        return None

Python also can also be used to access the database and audit the data. It could also be used to transcode the data to UTF-8 while copying it if necessary.

Best Answer

Related Solutions

Postgresql – Is it ever a good idea to denormalize for integrity

Postgresql – Restoring SQL_ASCII dumps to a UTF8-encoded database

Related Question