PostgreSQL Query Error – How to Fix Column Cannot Be Referenced in This Part of Query

duplicationpostgresqlsubquery

I've been learning Postgres (coming from SQL Server), and this error really confuses me.

Here is the code with some sample data:

create table T (
ID serial primary key,
A varchar(1),
B varchar(1),
C varchar(1)
)

↑ Testing table.

insert into T (A, B, C)
values('A', 'B', 'C'), ('A', 'B', 'C')

↑ Insert duplicates

delete from T
where ID in (
    select t.ID
    from (  select ID, row_number() over (partition by A,B,C order by A,B,C) as rn
            from T) as t
    where t.rn < (select max(t.rn) from t)
    )

↑ Delete duplicate keeping the last entry.

The problem is in the (select max(t.rn) from t) I am assuming this is a noob mistake related to not knowing postgres syntax when it comes to referencing columns with aliases?

Best Answer

First off, partition by A,B,C order by A,B,C makes no sense. Since you intent to keep the "last" row (meaning the one with the greatest ID), you probably meant:

partition by A,B,C order by ID

Still, the syntax is invalid on top of this. This subquery expression contains a reference to a column of the outer query: (select max(t.rn) from t). The scope of the subquery does not include columns from the outer query so rn is not visible in there. Only columns of the table t are.

You could use a CTE to allow the reference and make the syntax valid:

WITH cte AS (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id) AS rn FROM t)
DELETE FROM t
WHERE  id IN (
    SELECT id
    FROM   cte
    WHERE  rn < (SELECT max(rn) FROM cte)
    )

Still, the query is dangerous nonsense. Do not use this!
Comparing to the greatest row number is logical nonsense as each group of peers might have a different number of dupes. Would delete a lot more than it should.

Simpler, and correct:

DELETE FROM t
WHERE  id IN (
    SELECT t1.id
    FROM  (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id DESC) AS rn FROM t) t1
    WHERE  t1.rn > 1  --  all but the latest
    );

Which, in turn, can be had more cheaply as (assuming all columns NOT NULL!):

DELETE FROM t
WHERE  EXISTS (
   SELECT FROM t AS t1
   WHERE t1.a = t.a
   AND   t1.b = t.b
   AND   t1.c = t.c
   AND   t1.id > t.id
   );

"... where a dupe exists with a greater ID".

Related: