PostgreSQL Longest Record – How to Find the Longest Record for Each Column in PostgreSQL

postgispostgresqltable

I couldn't find the answer to this on here so sorry if repeated. I'm looking to query a table to find the longest record in characters for each column in a specific table.

I have a table which is built of fields which are type: varchar and length:254

In certain cases, I know for a fact that the longest record is a single character so the problem is when exporting to shapefiles the .dbf file is over 1gb when not needed.

If someone can also help with a script to reduce the lengths of each column as well that would be great!

Best Answer

The below worked for what I needed:

SELECT
        MAX(char_length(column1)) as column1,
        MAX(char_length(column2)) as column2
FROM
        schema.table

"Simple" Solution

SELECT DISTINCT ON (1)
       n.number, p.code
FROM   num n
JOIN   prefix p ON right(n.number, -1) LIKE (p.code || '%')
ORDER  BY n.number, p.code DESC;

Key elements:

DISTINCT ON is a Postgres extension of the SQL standard DISTINCT. Find a detailed explanation for the used query technique in this related answer on SO.
ORDER BY p.code DESC picks the longest match, because '1234' sorts after '123' (in ascending order).

Simple SQL Fiddle.

Without index, the query would run for a very long time (didn't wait to see it finish). To make this fast, you need index support. The trigram indexes you mentioned, supplied by the additional module pg_trgm are a good candidate. You have to choose between GIN and GiST index. The first character of the numbers is just noise and can be excluded from the index, making it a functional index in addition.
In my tests, a functional trigram GIN index won the race over a trigram GiST index (as expected):

CREATE INDEX num_trgm_gin_idx ON num USING gin (right(number, -1) gin_trgm_ops);

Advanced dbfiddle here.

All test results are from a local Postgres 9.1 test installation with a reduced setup: 17k numbers and 2k codes:

Total runtime: 1719.552 ms (trigram GiST)
Total runtime: 912.329 ms (trigram GIN)

Much faster yet

Failed attempt with `text_pattern_ops`

Once we ignore the distracting first noise character, it comes down to basic left anchored pattern match. Therefore I tried a functional B-tree index with the operator class text_pattern_ops (assuming column type text).

CREATE INDEX num_text_pattern_idx ON num(right(number, -1) text_pattern_ops);

This works excellently for direct queries with a single search term and makes the trigram index look bad in comparison:

SELECT * FROM num WHERE right(number, -1) LIKE '2345%'

Total runtime: 3.816 ms (trgm_gin_idx)
Total runtime: 0.147 ms (text_pattern_idx)

However, the query planner will not consider this index for joining two tables. I have seen this limitation before. I don't have a meaningful explanation for this, yet.

Partial / functional B-tree indexes

The alternative it to use equality checks on partial strings with partial indexes. This can be used in a JOIN.

Since we typically only have a limited number of different lengths for prefixes, we can build a solution similar to the one presented here with partial indexes.

Say, we have prefixes ranging from 1 to 5 characters. Create a number of partial functional indexes, one for every distinct prefix length:

CREATE INDEX prefix_code_idx5 ON prefix(code) WHERE length(code) = 5;
CREATE INDEX prefix_code_idx4 ON prefix(code) WHERE length(code) = 4;
CREATE INDEX prefix_code_idx3 ON prefix(code) WHERE length(code) = 3;
CREATE INDEX prefix_code_idx2 ON prefix(code) WHERE length(code) = 2;
CREATE INDEX prefix_code_idx1 ON prefix(code) WHERE length(code) = 1;

Since these are partial indexes, all of them together are barely larger than a single complete index.

Add matching indexes for numbers (taking the leading noise character into account):

CREATE INDEX num_number_idx5 ON num(substring(number, 2, 5)) WHERE length(number) >= 6;
CREATE INDEX num_number_idx4 ON num(substring(number, 2, 4)) WHERE length(number) >= 5;
CREATE INDEX num_number_idx3 ON num(substring(number, 2, 3)) WHERE length(number) >= 4;
CREATE INDEX num_number_idx2 ON num(substring(number, 2, 2)) WHERE length(number) >= 3;
CREATE INDEX num_number_idx1 ON num(substring(number, 2, 1)) WHERE length(number) >= 2;

While these indexes only hold a substring each and are partial, each covers most or all of the table. So they are much larger together than a single total index - except for long numbers. And they impose more work for write operations. That's the cost for amazing speed.

If that cost is too high for you (write performance is important / too many write operations / disk space an issue), you can skip these indexes. The rest is still faster, if not quite as fast as it could be ...

If numbers are never shorter then n characters, drop redundant WHERE clauses from some or all, and also drop the corresponding WHERE clause from all following queries.

Recursive CTE

With all the setup so far I was hoping for very elegant solution with a recursive CTE:

WITH RECURSIVE cte AS (
   SELECT n.number, p.code, 4 AS len
   FROM   num n
   LEFT    JOIN prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5

   UNION ALL 
   SELECT c.number, p.code, len - 1
   FROM    cte c
   LEFT   JOIN prefix p
            ON  substring(number, 2, c.len) = p.code
            AND length(c.number) >= c.len+1  -- incl. noise character
            AND length(p.code) = c.len
   WHERE    c.len > 0
   AND    c.code IS NULL
   )
SELECT number, code
FROM   cte
WHERE  code IS NOT NULL;

Total runtime: 1045.115 ms

However, while this query isn't bad - it performs about as good as the simple version with a trigram GIN index - it doesn't deliver what I was aiming for. The recursive term is planned once only, so it can't use the best indexes. Only the non-recursive term can.

UNION ALL

Since we are dealing with a small number of recursions we can just spell them out iteratively. This allows optimized plans for each of them. (We lose the recursive exclusion of already successful numbers, though. So there is still some room for improvement, especially for a wider range of prefix lengths)):

SELECT DISTINCT ON (1) number, code
FROM  (
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 4) = p.code
            AND length(n.number) >= 5
            AND length(p.code) = 4
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 3) = p.code
            AND length(n.number) >= 4
            AND length(p.code) = 3
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 2) = p.code
            AND length(n.number) >= 3
            AND length(p.code) = 2
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 1) = p.code
            AND length(n.number) >= 2
            AND length(p.code) = 1
   ) x
ORDER BY number, code DESC;

Total runtime: 57.578 ms (!!)

A breakthrough, finally!

SQL function

Wrapping this into an SQL function removes the query planning overhead for repeated use:

CREATE OR REPLACE FUNCTION f_longest_prefix()
  RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1) number, code
FROM  (
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 5) = p.code
            AND length(n.number) >= 6  -- incl. noise character
            AND length(p.code) = 5
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 4) = p.code
            AND length(n.number) >= 5
            AND length(p.code) = 4
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 3) = p.code
            AND length(n.number) >= 4
            AND length(p.code) = 3
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 2) = p.code
            AND length(n.number) >= 3
            AND length(p.code) = 2
   UNION ALL 
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(number, 2, 1) = p.code
            AND length(n.number) >= 2
            AND length(p.code) = 1
   ) x
ORDER BY number, code DESC
$func$;

Call:

SELECT * FROM f_longest_prefix_sql();

Total runtime: 17.138 ms (!!!)

PL/pgSQL function with dynamic SQL

This plpgsql function is much like the recursive CTE above, but the dynamic SQL with EXECUTE forces the query to be re-planned for every iteration. Now it makes use of all the tailored indexes.

Additionally this works for any range of prefix lengths. The function takes two parameters for the range, but I prepared it with DEFAULT values, so it works without explicit parameters, too:

CREATE OR REPLACE FUNCTION f_longest_prefix2(_min int = 1, _max int = 5)
  RETURNS TABLE (number text, code text) LANGUAGE plpgsql AS
$func$
BEGIN
FOR i IN REVERSE _max .. _min LOOP  -- longer matches first
   RETURN QUERY EXECUTE '
   SELECT n.number, p.code
   FROM   num n
   JOIN   prefix p
            ON  substring(n.number, 2, $1) = p.code
            AND length(n.number) >= $1+1  -- incl. noise character
            AND length(p.code) = $1'
   USING i;
END LOOP;
END
$func$;

The final step cannot be wrapped into the one function easily. Either just call it like this:

SELECT DISTINCT ON (1)
       number, code
FROM   f_longest_prefix_prefix2() x
ORDER  BY number, code DESC;

Total runtime: 27.413 ms

Or use another SQL function as wrapper:

CREATE OR REPLACE FUNCTION f_longest_prefix3(_min int = 1, _max int = 5)
  RETURNS TABLE (number text, code text) LANGUAGE sql AS
$func$
SELECT DISTINCT ON (1)
       number, code
FROM   f_longest_prefix_prefix2($1, $2) x
ORDER  BY number, code DESC
$func$;

Call:

SELECT * FROM f_longest_prefix3();

Total runtime: 37.622 ms

A bit slower due to required planning overhead. But more versatile than SQL and shorter for longer prefixes.

Indexing a Large Static Dataset in PostgreSQL

Query cleanup

First, lets rewrite your query to be readable, by using table aliases, qualifying field names, and using an ANSI join:

select t.userID, t.date, t.time, t.servID, t.timestamp, 
       l.servID_HEX, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
where t.userID='<someusers>' 
  and extract(dow from t.timestamp)=2;

What the query needs

To start breaking this problem down you'll want to do is look at the fields used and where they're used:

Tables: locations, test

Fields output: test.userID, test.date, test.time, test.servID, test.timestamp, locations.servID_HEX, locations.SERV_LOCY, locations.SERV_LOCX

Terms used in filter predicates: test.servID, locations.servID_HEX, test.userID, extract(dow from test.timestamp)=2

Indexes

Now, the most important thing is to make sure that you have indexes that target high-selectivity columns used in predicates.

"high selectivity" just means that there aren't tons of the same value - there's no point indexing a column if 50% of the time it's 'a' and 50% of the time it's 'b', because you just don't gain much benefit from the index.

Assuming that all these columns have lots of well distributed distinct values, so they're highly selective and don't have any values that're massively common, you'll want to create an ordinary b-tree index. That's enough for locations because there's just one value used in a predicate:

CREATE INDEX locations_servid_idx ON locations(servID_HEX);

For test it's more complex.

You could create indexes separately for each column, but it's more efficient to create a single composite index that has all the columns you're using in this query's predicates for a given table. So you want servId, userID and timestamp. However, you don't use timestamp directly - you use it to get the day of week, and an index on timestamp can't be used to look up extract(dow from timestamp).

This is where expression indexes come in.

Indexes don't just cover columns. They can also cover arbitrary expressions. In this case, we'll create an index that includes extract(dow from timestamp).

So the index for test will look like:

CREATE INDEX test_custom_idx ON test (
    servID,
    extract(dow from timestamp),
    userID
);

As you'll be doing different kinds of timestamp based queries I'd consider creating multiple indexes for different timestamp expressions.

Which order to put the columns in depends on how selective each column is. Even though I'm sure the day-of-week will be more selective I've put the server ID first because it's used in a join condition that lets us completely disregard rows of the other table.

Since you didn't provide sample tables or sample data I can't really test it and I'm not keen on dummying some up. Adjustments may be required.

Partial indexes

Partial indexes can also be a big win, where the index only contains data for a subset of the data, e.g.:

CREATE INDEX test_custom_idx ON test (
    servID, userID
) WHERE (extract(dow from timestamp) = 2);

contains only data for that day, so it's a lot faster to search for queries involving just that day, but cannot be used at all for queries that might involve different days, or for which the day is determined at runtime from the output of another part of the query.

Index-only scans and covering indexes

Another thing you can consider is that all those unused columns still have to be read from disk to fetch your rows. In PostgreSQL 9.2 and newer there's a way to avoid that using an index-only scan if all columns you are fetching are in an index.

So you could create a covering index, where you include columns that aren't really required for the actual search but are there so they can be output without reading from the main table. The first thing to do for that is to get rid of useless values in your SELECT list, so we'll delete the redundant t.date and t.time from your query. Then create two new indexes:

CREATE INDEX locations_covering_idx 
ON locations(
    servID_HEX,
    SERV_LOCY,
    SERV_LOCX
);

CREATE INDEX test_covering_idx ON test (
    servID,
    extract(dow from timestamp),
    userID,
    timestamp
);

Reducing repetition

Since your dataset is static, having tons and tons of indexes isn't a big problem.

Each index has a cost for insert/update/delete. So for non-static data you want to minimize that by getting rid of indexes that don't provide enough improvement to be worth the cost.

One way to do that can be to split up composite indexes. In your case for example I might take the timestamp stuff out of the main index on location and put it in a bunch of new indexes instead. That way all the different timestamp expression indexes can be updated without rewriting a whole lot of unrelated data for userid and serverid as well. PostgreSQL can still often use the indexes together, combining them with a bitmap index scan.

Not a concern for your current dataset, but worth learning for later.

Data cleanup

This is a static data set. So if you don't need a column, drop it. Or better yet, create a new table with just the subset of data you need, pre-joined, using something like:

CREATE TABLE reporting AS
select t.userID, t.servID, t.timestamp, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
ORDER BY t.userID, t.servID, t.timestamp;

which creates a new table that has the data pre-joined and sorted for the most useful grouping of access.

You can then write much simpler queries that'll be much, much faster, against the new table, and your indexes will be a lot simpler too.