PostgreSQL – Select Unique IDs Where Second Column is Unique

greatest-n-per-grouppostgresqlselect

Look at the following example starting from the top row (id=9) and work your way down, selecting a limit of 4 rows that have sec's that we have not yet seen. We "select" id=9 because we don't yet have sec=1. We continue to work our way down like this, but when we get to id=7 we skip it because we already have sec=5 (from row with id=8). We continue in the same manner, and we finally stop at id=3 because we have accumulated 4 rows (our desired limit).

 id | sec
----+-----
  9 |   1  <- 1
  8 |   5  <- 2
  7 |   5  # skip, already have sec=5
  6 |   4  <- 3
  5 |   1  # skip, already have sec=1
  4 |   1  # skip, already have sec=1
  3 |   3  <- 4
  2 |   2
  1 |   1

Of course the SQL algorithm can (will!) be different than I described.

Desired result:

 id
----
  9
  8
  6
  3
(4 rows)

If I wanted to increase the limit to 5 rows, then the row with id=2 would be included in the results. However, if I increased the limit to 6 rows, the row with id=1 would not be added because sec=1 has already been seen.

Note: Though it shouldn't matter, I am on PostgreSQL 9.3.1.

In case you want to quickly build the table to test this out:

CREATE TABLE my_table (id serial primary key, sec integer DEFAULT 0 NOT NULL);
INSERT INTO my_table (sec) VALUES
  (1)
, (2)
, (3)
, (1)
, (1)
, (4)
, (5)
, (5)
, (1);
CREATE INDEX index_my_table_on_sec ON my_table (sec);

Best Answer

In Postgres, this is simpler with DISTINCT ON:

SELECT *
FROM (
   SELECT DISTINCT ON (sec)
          id, sec
   FROM   tbl
   ORDER  BY sec, id DESC
   ) sub
ORDER  BY id DESC
LIMIT  4;

Detailed explanation in this related answer on SO:

Select first row in each GROUP BY group?

For a big table and small LIMIT, neither this nor @a_horse's solution are very efficient. The subquery will plough through the whole table, wasting a lot of time ...

Recursive CTE

I have tried and failed to solve similar problems with a recursive CTE in the past and resorted to a procedural solution with PL/pgSQL. Example:

Self-join unique pairs

Finally, here is a working rCTE:

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, '{}'::int[] AS last_arr, ARRAY[sec] AS arr
   FROM   tbl
   ORDER  BY id DESC
   LIMIT  1
   )
   UNION ALL
   (
   SELECT b.id, c.arr
        , CASE WHEN b.sec = ANY (c.arr) THEN c.arr ELSE b.sec  || c.arr END
   FROM   cte c
   JOIN   tbl b ON b.id < c.id
   WHERE  array_length(c.arr, 1) < 4
   ORDER  BY id DESC
   LIMIT  1
   )
   )
SELECT id, arr[1] AS sec
FROM   cte
WHERE  last_arr <> arr;

It's not as fast or elegant as I had hoped for and not nearly as fast as the function below, but faster than the query above in my tests.

PL/pgSQL function

Fastest by far:

CREATE OR REPLACE FUNCTION f_first_uniq(_rows int)
   RETURNS TABLE (id int, sec int) AS
$func$
DECLARE
   _arr int[];
BEGIN
   FOR id, sec IN
      SELECT t.id, t.sec FROM tbl t ORDER BY t.id DESC
   LOOP
      IF sec = ANY (_arr) THEN
         -- do nothing
      ELSE
         RETURN NEXT;
         _arr := _arr || sec;
         EXIT WHEN array_length(_arr, 1) >= _rows;
      END IF;
   END LOOP;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_first_uniq(4);

SQL Fiddle demonstrating all three.

Could be made out to work for any table with table and column names as parameters and dynamic SQL with EXECUTE ...

Why bother?

In a test table with only 30k rows the function ran 2000x faster than the above query (which already ran ~ 30% faster than a_horse's version). This difference grows with the size of the table. Performance of the function is about constant, while the query's performance gets progressively worse, since it tries to find distinct values in all of the table first. Try this in a table with a million rows ...

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

The execution plan shown does not seem to match the big SELECT DISTINCT query because the Sort and Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.

For the 2nd part of the question:

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.

In one single query, this could be written like this:

WITH logs as (
  SELECT log_time_mapped, syslog_priority_mapped, 
     operation_mapped, message_code_mapped, protocol_mapped, 
     source_ip_mapped, destination_ip_mapped, 
     source_port_mapped, destination_port_mapped, 
     destination_service_mapped, direction_mapped, 
     connections_built_mapped, connections_torn_down_mapped, 
     hourofday_mapped, meridiem_mapped 
  FROM firewall_logs_mapped 
  WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14,  hourofday_mapped, meridiem_mapped FROM logs
;

The first SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT eliminates duplicates, so the client side doesn't have to do it. The UNION ALL concatenates the results without any further processing.

However it's a heavy query and it's dubious that it would be any faster than what you're already doing.

The WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:

SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;

Postgresql – How to maintain high INSERT-performance on PostgreSQL

There are a few things that might be causing this issue, but I can't be sure any of them are the real problem. The troubleshooting all involves turning on extra logging in the database, then seeing if the slow parts line up with messages there. Make sure you put a timestamp in the log_line_prefix setting to have useful logs to look at. See my tuning intro to get started here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Postgres does all of its writes to the operating system cache, then later they head to disk. You can watch that by turning on log_checkpoints and reading the messages. When things slow down, it may simply be that all the caches are now full, and all writes are stuck waiting for the slowest part of I/O. You might improve this by changing the Postgres checkpoint settings.

There's an internal issue with the database people hit sometimes where heavy inserts get stuck waiting for a lock in the database. Turn on log_lock_waits to see if you're hitting that one.

Sometimes the rate you can do burst inserts at is higher than you can sustain once the system autovacuum process kicks in. Turn on log_autovacuum to see if the problems are concurrent with when it's happening.

We know that large amount of memory in the database's private shared_buffers cache doesn't work as well on Windows as it does on other operating systems. There isn't as much visibility into what goes wrong when it happens either. I would not try to host something that's doing 1000+ inserts a second to a Windows PostgreSQL database. It's just not a good platform for really heavy writes yet.