Postgresql – Are ORDER BY clause & ROW_NUMBER duplicating functionality when I need top n queries

greatest-n-per-grouppostgresqlwindow functions

I have a table that tracks violations for a student. I want to count the number of violations and select top 2 violators from each class. The query would look like this

SELECT *
FROM
  ( SELECT "people"."id", "name", "class",
    ROW_NUMBER() OVER (PARTITION BY "class" ORDER BY COUNT("violation") DESC) AS "v"
    FROM "people"
    INNER JOIN "discipline" on ("discipline"."people_id" = "people"."id")
    GROUP BY "people"."id", "name", "class"
  ) AS "v_table"
WHERE v < 3

The query seems inefficient because it sorts the count first with ORDER BY and then assigns a ROW_NUMBER. If I already have it sorted over a partition, how can I skip the ROW_NUMBER assignment & get the top 2.

Update :
Adding data tables & SQL fiddle (actually violation column is redundant, simply an entry in discipline table means that there's been a violation. If I remove it, should the Count be done on "discipline"."people_id"? like this )

Discipline                     People       
-------------------------      --------------
id  people_id  violation       id  name  class                    
1   1          True            1   Rob   A                 
2   1          True            2   Jen   B                 
3   2          True            3   Tom   C                
4   3          True            4   Ted   A                 
5   4          True            5   Tim   A            
6   1          True            ...                               
7   4          True         
...

Using PostgreSQL 9.3

Best Answer

First, to answer the questions implied in the comments, that the assignment of row numbers with the ROW_NUMBER() aggregate seems inefficient because we already have the COUNT(violation) numbers:

This is needed because the COUNT numbers can be different for each partition (class). Since we want the 2 top numbers (for each class), we can't find a useful condition for that. With the row numbers, we can use the WHERE v < 3 which gives us the top 2.

In version 9.3, the LATERAL joins were added in Postgres, which are similar to the CROSS and OUTER APPLY of SQL-Server. With this new kind of join, you can write a query that uses the COUNT numbers and a TOP 2 for each partition. Whether it is more or less efficient, you can test:

WITH classes AS
  ( SELECT DISTINCT class
    FROM people
  ) 
                             -- if you have a "classes" table, skip the above lines
SELECT 
    v.id, v.name, c.class,
    v.violations
FROM
    classes AS c,
  LATERAL 
    ( SELECT p.id, p.name,
             COUNT(d.violation) AS violations
      FROM people AS p
        INNER JOIN discipline AS d
          ON d.people_id = p.id
      WHERE p.class = c.class 
      GROUP BY p.id, p.name
      ORDER BY violations DESC
      LIMIT 2
  ) AS v
ORDER BY
    c.class, v.violations DESC ;

Tested at SQL-Fiddle

Selectivity of predicates and combinations thereof

Your most important problem seems to be here:

(cost=0.42..84,314.74 rows=2,088 width=58) (actual time=2,363.900..2,363.904 rows=10 loops=1)

Postgres expect 209 times as many rows as returned. explain.depesz.com can help auditing a query plan: http://explain.depesz.com/s/53E

You may be able to get more accurate estimations by increasing the statistics target for the indexed columns. Postgres not only gathers statistics for table columns, it does the same for indexed expressions (not for plain index columns for which statistics are available already). Details:

Index that is not used, yet influences query

You can check with:

SELECT * FROM pg_stats WHERE tablename = 'users_first_name_gin';

Basic row counts and settings are in pg_class and pg_attribute:

SELECT *
FROM   pg_attribute
WHERE  attrelid = 'users_last_name_gin'::regclass;

SELECT *
FROM   pg_class
WHERE  oid = 'users_last_name_gin'::regclass;

Indexes are treated as special tables internally. This should make it less surprising that you can do use ALTER TABLE on an index:

ALTER TABLE users_last_name_gin ALTER COLUMN f_unaccent SET STATISTICS 1000;

Use this to increase the sample size for calculating statistics for the index column. Then run ANALYZE users to update statistics.

You can't provide explicit column names for index entries, names are chosen automatically. You can look it up with the query on pg_attribute above. The column name f_unaccent is derived from the used function name.

Default statistics target is 100, the range of 0 - 10000 is allowed. For very big tables, 100 is often not enough to get reasonable estimates. Set it to 1000 (example) for the index expression to get better estimates.

Workaround

Like dezso commented you can get the alternative query plan by encapsulating the original form in a CTE (which acts as optimization barrier in Postgres) - before ORDER BY and LIMIT in the outer SELECT:

WITH cte AS (
   SELECT *, count(*) OVER() AS filtered_count
   FROM   users 
   WHERE (f_unaccent("users"."first_name")       ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."last_name")        ILIKE f_unaccent('%foo%') OR 
          f_unaccent("users"."club_or_hometown") ILIKE f_unaccent('%foo%'))
   )
SELECT *
FROM   cte
ORDER  BY first_name
LIMIT  10;

Alternative index

Since you commented:

I'm always searching all three columns

A single index for all three would be cheaper. GIN indexes can be multicolumn indexes. The documentation:

Currently, only the B-tree, GiST and GIN index types support multicolumn indexes.

And:

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

So:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
     f_unaccent(first_name)       gin_trgm_ops
   , f_unaccent(last_name)        gin_trgm_ops
   , f_unaccent(club_or_hometown) gin_trgm_ops);

This would reduce the triple overhead per index entry to just one. Should be faster overall. Or, faster yet, concatenate all three columns into a single string. I am adding a space as separator to avoid false positives. Use any character as separator that is not going to show up in the search expression:

CREATE INDEX big_unaccent_big_gin_idx ON users USING gin (
   f_unaccent(concat_ws(' ', first_name, last_name, club_or_hometown)) gin_trgm_ops);

If all columns are NOT NULL, you can use plain concatenation instead:

first_name || ' ' || last_name || ' ' || club_or_hometown

Be sure to use the same expression in your query:

WHERE f_unaccent(concat_ws(' ', first_name, last_name, club_or_hometown)) ILIKE '%foo%'

Set STATISTICS to 1000 or more like demonstrated above and ANALYZE before you test again. Be sure to run the query multiple times to compare warm cache to warm cache.

Besides the smaller index and faster computation, the main benefit for your case could be that a single predicate is less susceptible to errors in the cost estimation. Combining multiple predicates adds errors to the calculation.

Force query plan

If all else fails, you can force your preferred query plan like I suggested in the comment by disabling index scans temporarily. Remember, this is an evil hack that may fire back if underlying data distribution changes or if you upgrade to the next Postgres version:

Use SET LOCAL to confine the effect to the transaction and wrap the whole thing in an explicit transaction.

BEGIN;
SET LOCAL enable_indexscan = off;
SELECT  ...  -- your query here
COMMIT;

Postgresql – ny way in Postgres to parameterise a procedure for sort order column, when the columns are of different types

Be careful with conditional ordering, it can create bad query plans sometimes forcing table scanning. If the filtering and joining clauses, or just the size of the actual data, mean that you have a small number of rows to sort at the end then this is not an issue and something like this will work:

ORDER BY CASE WHEN ordering_column = 'id'        THEN id        ELSE NULL END
       , CASE WHEN ordering_column = 'timestamp' THEN timestamp ELSE NULL END

In fact it will work anyway, it might just be inefficient for a large amount of data.

For larger outputs your workaround may be more efficient as it may be able to make better use of indexes for the sorting. Another alternative is to have two procedures, one for each sort, and either call each as needed or have your main procedure call the others depending on the sort order it is passed in the parameter. Depending on how postgres handles cached query plans for procedures this may^[†] avoid issues of a cached plan for one case being used for another where it is vastly less efficient.

^{[†] I'm no expert at all on pg's internals, but single "kitchen sink" procedures and queries with conditional sorts etc. can be a performance killer in SQL Server for this sort of reason.}