PostgreSQL – How to Shorten a Multi-Word String to Few Words?

arraypostgresqlstring

I'm using PostgreSQL 9.5.6. I have space separated multi-word strings that I need to shorten to say 3 words?

I've looked at the documentation and in order to use the substring() functions I need to know the index position of the char where I want to start extraction, but the strings I'm working with are varying length.

I've also looked at split_part() and that only returns one word.

How can I get something like:

hello everyone out there somewhere

shortened to

hello everyone out

Best Answer

Turn the sentence into an array, then pick the first three elements and concatenate them back to a single string:

select array_to_string ((regexp_split_to_array(title, '\s+'))[1:3], ' ')
from book;

alternatively, just use split_part() three times:

select split_part(title, ' ', 1)||
       split_part(title, ' ', 2)||  
       split_part(title, ' ', 3)
from book;

split_part() has the disadvantage, that you can't properly deal with multiple spaces between the words.

Optimizing storage

If you have more than a hand full variables but less than 33, an integer column may serve you best. (Or a bigint for up to 64 variables.)

Occupies 4 bytes on disk.
Very fast indexing for exact matches (= operator).
Handling individual values may be slower / less convenient than with bit string or boolean.

With even more variables, or if you want to manipulate the values a lot, or if you don't have huge tables and disk space / RAM is no issue, or if you are not sure what to pick, I would consider bit(n) or bit varying(n).

Occupies at least 5 bytes (or 8 for very long strings) plus 1 byte for each group of 8 bits (rounded up).
You can use bit string functions and operators directly.

Examples

For just 3 bits of information, individual boolean columns get by with 3 bytes, an integer needs 4 bytes and a bit string 6 bytes (5 + 1).

For 32 bits of information, an integer still needs 4 bytes, a bit string occupies 9 bytes for the same (5 + 4) and boolean columns occupy 32 bytes.

The file names are 24 hex digits long, broken into 3 fields of 8 digits each. The first is the timeline, which you probably don't need to worry about here. The other two are called the "segment" and the "file". They are really a single counter, the "file" rolls over by incrementing the "segment" by one and going back to zero.

Oddly, only the last two digits of the "file" chunk are ever non-zero. (This is apparently so that the offset into a logical 'segment' fits in a 32 bit uint, which was important at some point in the distant past)

In 9.1 and 9.2, the "file" skips FF. So from FE, it rolls back over to 00 while incrementing the segment number. In 9.3, it switches so that FF is used. So the sequence for your version would go:

0000000100000000000000FD
0000000100000000000000FE
000000010000000100000000
000000010000000100000001
...
00000001000000BE000000F9
00000001000000BE000000FA
00000001000000BE000000FB
00000001000000BE000000FC
00000001000000BE000000FD
00000001000000BE000000FE
00000001000000BF00000000
00000001000000BF00000001
00000001000000BF00000002
00000001000000BF00000003
00000001000000BF00000004

Getting the files from pg_xlog is tricky, because those files get recycled (i.e. renamed to have the name of some future file-name, then overwritten once it gets around to needing that file). This means that the timestamps on the files are all screwy. A recycled log file will have the name of some future file, but the contents and timestamp of some past file.

If a file was recycled before you had a chance to copy it, then your backup has failed and you need to try again with a higher wal_keep_segments.

And of course if you copy the "current" file before it was finalized, it will be missing the changes that occurred after the point you copied it.

It is much easier to get them from the wal archive, because that doesn't have the recycling problem and because they don't show up at all until they are finalized.

Best Answer

Related Solutions

Postgresql – Should I use the PostgreSQL bit string

Optimizing storage

Examples

Further reading

Postgresql – How to identify all WALs created during start_backup and stop_backup

Related Question