PostgreSQL – How to Limit an Inner Query by Outer Query Attribute

greatest-n-per-groupjoin;postgresqlsubquery

I have a requirement to sort a table by multiple criteria, select the first and associate with another table. The DB is Postgres.

Let's say we have players and teams where player has played from several teams. This is not the real composition of the model but should suffice for our example.

The requirement is to get the most recent record of the team the player played for, ordered by teams.end_date if it's present, otherwise by teams.start_date.

The algorithm would seem to be group by players.id, and order by the criteria mentioned above. However all of my attempts to limit the results to 1 gives me the first team record that is ordered by that criteria.

Here's what I think is the closest I got:

SELECT 
  latest_teams.* 
FROM players
  JOIN (
    SELECT 
      player_id,
      team_name,
      start_date,
      end_date
    FROM teams
    ORDER BY .... --this part is fine
    limit 1
  ) AS latest_teams
  ON players.id = teams.player_id

How can I get the scope of the inner query to be that of the player id?

Best Answer

First, if you SELECT only from the teams table and there is a FOREIGN KEY constraint that REFERENCES players, you don't need the join to players at all.

Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, greatest-n-per-group. We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players, you can simply join and add the columns in the select list.:

SELECT DISTINCT ON (t.player_id)
  t.player_id,
  t.team_name,
  t.start_date,
  t.end_date
FROM teams AS t
  -- JOIN players AS p            
  --   ON p.id = t.player_id
ORDER BY 
    t.player_id,  -- this needs to match the DISTINCT ON ()
    ....      -- pick what order you want, so the 1st is chosen for each player
 ;            -- no LIMIT

Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER clause:

SELECT 
    g.player_id,
    g.team_name,
    g.start_date,
    g.end_date
FROM
  ( SELECT 
      t.player_id,
      t.team_name,
      t.start_date,
      t.end_date,
      ROW_NUMBER() OVER (PARTITION BY t.player_id 
                         ORDER BY ...) AS rn
    FROM teams AS t
      -- JOIN players AS p            
      --   ON p.id = t.player_id
  ) AS g
WHERE
    g.rn = 1 ;

If you are not in some ancient version of Postgres, there is also the LATERAL syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams table is, in comparison to the players, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players columns to be referenced inside the lt subquery:

SELECT
    lt.player_id,
    lt.team_name,
    lt.start_date,
    lt.end_date
FROM players AS p
  JOIN LATERAL
    ( SELECT t.*
      FROM teams AS t 
      WHERE p.id = t.player_id
      ORDER BY ....     -- pick the order
      LIMIT 1 
   ) AS lt ON TRUE ;

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

Use a subquery (as displayed) or CTE for that purpose:

SELECT *
FROM  (
   SELECT qid, gid
   FROM   table1
   ORDER  BY date DESC
   LIMIT  10
   OFFSET ?
   ) q
JOIN   table2 a USING (qid, gid)

USING (qid, gid) is just a shortcut for ON q.qid = a.qid AND q.gid = a.gid with the side effect that the two columns are only included once in the result.

SQL Server – Syntax of INNER JOIN Nested Inside OUTER JOIN

If you look at the 2 execution plans, is there an easy answer to which is better? I purposefully did NOT create indexes so it's easier to see what's happening.

The second plan has a lower estimated cost, so in that limited sense it is 'better'.

The data sets are so small that the optimizer did not spend much time looking at alternatives. The first form of the query happens to find a plan using hash join and a table spool early on. The estimated cost of that plan is so low that the optimizer does not bother looking for anything better.

The second form of the query happens to find a plan using only nested loops outer joins early in the search process, and again the optimizer decides that plan is good enough. It so happens that this plan is estimated to be cheaper.

That said (as mentioned in the question comments) the two queries are not semantically identical. This may not be important to you if you can guarantee that the results will always be the same for all possible future states of your database, but the optimizer cannot make that assumption. It only ever produces plans that are guaranteed to produce the same results specified by the SQL, in all circumstances.

I have realized that the nested syntax also modifies the behaviour of the query.

The 'nested syntax' is just one aspect of the whole ANSI join syntax specification. To enable a full logical specification for more complex join patterns, the specification allows (optional) parentheses, and FROM clause subqueries.

The query can be written using the same ANSI syntax using parentheses:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN
(
    dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) ON M.ModelID = A.ModelID;

This form clearly shows that the logical requirement is to left join from Autos to the result of inner joining Manufacturers to Models. Omitting the optional parentheses gives the form you call 'nested':

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Autos AS A
LEFT JOIN dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
    ON M.ModelID = A.ModelID;

This is not a different syntax - it is just omitting optional parentheses and reformatting a bit.

As Martin mentioned, it is also possible in this case to express the logical requirement using inner joins followed by a right outer join:

SELECT
    A.*,
    M.*,
    N.* 
FROM dbo.Manufacturers AS N
JOIN dbo.Models AS M
    ON M.ManufacturerID = N.ManufacturerID
RIGHT JOIN dbo.Autos AS A
    ON A.ModelID = M.ModelID;

All three query forms above use the same ANSI join syntax. All three also happen to produce the same physical execution plan with the data set provided:

Common execution plan

As I mentioned in my answer to your previous question, queries that express exactly the same logical requirement will not always produce the same execution plan. Which logical query form you prefer to use is largely a question of style. There is no correlation between one particular style and 'better' query plans in general. I would generally advise against rewriting a query to get a particular plan if the new query is not genuinely logically identical to the original.

The SQL standard also allows FROM clause subqueries, so yet another way to write the same query specification is:

SELECT * 
FROM dbo.Autos AS A
LEFT JOIN
(
    SELECT
        N.ManufacturerID,
        ManufacturerName = N.Name,
        M.ModelID,
        ModelName = M.Name
    FROM dbo.Manufacturers AS N
    JOIN dbo.Models AS M
        ON M.ManufacturerID = N.ManufacturerID
) AS R1
    ON R1.ModelID = A.ModelID;

Using the traditional syntax, we have to change the join to `Manufacturers to an outer join, like so... but this changes the query plan.

This probably changes the meaning of the query, in which case it is technically not a valid alternative (but see ypercube's comment on your question).

The (optional) parentheses in the ANSI join syntax are there precisely for more complex join requirements like this, so you should not be afraid to use them where necessary.

Best Answer

Related Solutions

Postgresql – Retrieving data from inner join using LIMIT and OFFSET

SQL Server – Syntax of INNER JOIN Nested Inside OUTER JOIN

Related Question