PostgreSQL – How to Limit an Inner Query by Outer Query Attribute

greatest-n-per-groupjoin;postgresqlsubquery

I have a requirement to sort a table by multiple criteria, select the first and associate with another table. The DB is Postgres.

Let's say we have players and teams where player has played from several teams. This is not the real composition of the model but should suffice for our example.

The requirement is to get the most recent record of the team the player played for, ordered by teams.end_date if it's present, otherwise by teams.start_date.

The algorithm would seem to be group by players.id, and order by the criteria mentioned above. However all of my attempts to limit the results to 1 gives me the first team record that is ordered by that criteria.

Here's what I think is the closest I got:

SELECT 
  latest_teams.* 
FROM players
  JOIN (
    SELECT 
      player_id,
      team_name,
      start_date,
      end_date
    FROM teams
    ORDER BY .... --this part is fine
    limit 1
  ) AS latest_teams
  ON players.id = teams.player_id

How can I get the scope of the inner query to be that of the player id?

Best Answer

First, if you SELECT only from the teams table and there is a FOREIGN KEY constraint that REFERENCES players, you don't need the join to players at all.

Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, . We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players, you can simply join and add the columns in the select list.:

SELECT DISTINCT ON (t.player_id)
  t.player_id,
  t.team_name,
  t.start_date,
  t.end_date
FROM teams AS t
  -- JOIN players AS p            
  --   ON p.id = t.player_id
ORDER BY 
    t.player_id,  -- this needs to match the DISTINCT ON ()
    ....      -- pick what order you want, so the 1st is chosen for each player
 ;            -- no LIMIT

Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER clause:

SELECT 
    g.player_id,
    g.team_name,
    g.start_date,
    g.end_date
FROM
  ( SELECT 
      t.player_id,
      t.team_name,
      t.start_date,
      t.end_date,
      ROW_NUMBER() OVER (PARTITION BY t.player_id 
                         ORDER BY ...) AS rn
    FROM teams AS t
      -- JOIN players AS p            
      --   ON p.id = t.player_id
  ) AS g
WHERE
    g.rn = 1 ;

If you are not in some ancient version of Postgres, there is also the LATERAL syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams table is, in comparison to the players, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players columns to be referenced inside the lt subquery:

SELECT
    lt.player_id,
    lt.team_name,
    lt.start_date,
    lt.end_date
FROM players AS p
  JOIN LATERAL
    ( SELECT t.*
      FROM teams AS t 
      WHERE p.id = t.player_id
      ORDER BY ....     -- pick the order
      LIMIT 1 
   ) AS lt ON TRUE ;