Postgresql – Retrieve title from rows with a maximum count in other columns

aggregatecountpostgresqlwindow functions

Given the following table archived:

CREATE TABLE archived (
   sid       serial -- row id
   showurl   text   -- can be: btg, tmwwt, gcaf, emda
   title     text
   reference text   -- title id number
);

Sample Input:

sid | showurl | title       | reference
----+---------+-------------+-----------
1   | btg     | 01-Preface  | 01
2   | btg     | 01-Preface  | 01
3   | emda    | 26-Friction | 26
4   | tmwwt   | 13-Helper   | 13
5   | btg     | 05-Finale   | 05
6   | gcaf    | 07-Help     | 07

SQLfiddle to work with.

Desired Output:

1st selection: showurl: 'btg'

2nd selection: reference: '01'

3rd and final selection: title: '01-Preface'

I am trying to determine from table archived which showurl has the greatest number of rows.

Then, for that 'showurl', I need to identify which reference has the greatest number of rows. Obviously, that 'reference' will be duplicated, but the 'sid' associated with each duplicated 'reference' will be unique.
Once that is determined, I wish to select/display its title.

My initial attempt (using php, postgresql 9.2: queried each showurl, then using max to identify the showurl with the greatest number of hits.):

$query = "SELECT * FROM archived WHERE showurl = 'btg' order by sid ASC";
$query = "SELECT * FROM archived WHERE showurl = 'tmwwt' order by sid ASC";
$query = "SELECT * FROM archived WHERE showurl = 'gcaf' order by sid ASC";
$query = "SELECT * FROM archived WHERE showurl = 'emda' order by sid ASC";

$max = max($numrows1,$numrows2,$numrows3,$numrows4);

Using the sample data above, $max would be 3, but showurl, title, reference, and sid are unavailable for further processing.

I have tried using pg_numrows() but I cannot find a way to refer back either to 'title' or 'reference' for the max rows.

I gather some form of UNION would be in order, but have no idea how to implement this.

Best Answer

Depending on your exact requirements, something along these lines should work:

Subqueries to determine most common attributes

SELECT DISTINCT title
FROM   archived
WHERE  showurl = (
   SELECT showurl
   FROM   archived
   GROUP  BY 1
   ORDER  BY count(*) DESC, 1
   LIMIT  1
   )
AND    reference = (
   SELECT reference
   FROM   archived
   GROUP  BY 1
   ORDER  BY count(*) DESC, 1
   LIMIT  1
   );

I added DISTINCT to the outer SELECT to fold duplicate titles. It's undefined what you need exactly.

Shorter with window functions

A more modern / sophisticated / elegant approach would be with two window functions in the ORDER BY clause, allowing for a very terse query:

SELECT title
FROM   archived
ORDER  BY count(*) OVER (PARTITION BY showurl)   DESC
         ,count(*) OVER (PARTITION BY reference) DESC
LIMIT 1;

Not necessarily faster than the first solution, though. Again, guessing what you might want exactly, I only retrieve a single winning title.

The trick here is to partition the rows by showurl and reference, run a count over each partition and order descending by that count. This way, you can determine a winner in a single query level.

This will not work, if there is no row with the most common showurl and the most common reference at the same time. Again, exact requirements would be needed ...

-> SQLfiddle demo.

Related Solutions

PostgreSQL – Using CASE in Window Functions to Partition by Age Group

You could use a LATERAL join for this. And no need for joining the users table twice:

FROM 
    competition_users AS cu
  LEFT JOIN users AS u
    ON cu.user_id = u.user_id
  LEFT JOIN LATERAL
    ( SELECT CASE WHEN u.dob > '2000-01-01' AND u.dob < now() 
                 THEN '0-15' 
                 ELSE '15+' 
             END AS ageGroup
    ) AS age
    ON TRUE
  LEFT JOIN current_competition_sessions AS ccs 
    ON cu.competition_user_id = ccs.competition_user_id 
WHERE cu.left_competition = FALSE 
  AND cu.competition_id = :compId

and then the WINDOW would be left (almost) unchanged:

WINDOW
    ageGroupTimes AS (PARTITION BY age.ageGroup ORDER BY cus.time_in_seconds ASC)

Alternatively, a simple CTE or derived table with a normal LEFT JOIN could be used, without the need for the (9.3+ available only) LATERAL. With the following you'll only need to slightly modify your upper-level SELECT and WINDOW (removing table aliases mainly):

FROM
  ( SELECT 
        cu.*, 
        cus.time_in_seconds,
        u.gender,
        CASE WHEN u.dob > '2000-01-01' AND u.dob < now() 
            THEN '0-15' 
            ELSE '15+' 
        END AS ageGroup
    FROM 
        competition_users AS cu
      LEFT JOIN users AS u
        ON cu.user_id = u.user_id
      LEFT JOIN current_competition_sessions AS ccs 
        ON cu.competition_user_id = ccs.competition_user_id 
    WHERE cu.left_competition = FALSE 
      AND cu.competition_id = :compId
  ) AS d

MySQL – Is It Possible to Reliably Number Rows in Some Order Conforming to SQL Standard?

There is a way to do that using only SQL 89 feature, but it will be horribly slow on any real world sized table:

select t1.id, 
       ... other columns ...,
       (select count(*) 
        from the_table t2 
        where t2.id <= t1.id) as row_num
from the_table t1
order by t1.id;

This assumes id is the column to sort the numbers by.