PostgreSQL Query Error – How to Fix Column Cannot Be Referenced in This Part of Query

duplicationpostgresqlsubquery

I've been learning Postgres (coming from SQL Server), and this error really confuses me.

Here is the code with some sample data:

create table T (
ID serial primary key,
A varchar(1),
B varchar(1),
C varchar(1)
)

↑ Testing table.

insert into T (A, B, C)
values('A', 'B', 'C'), ('A', 'B', 'C')

↑ Insert duplicates

delete from T
where ID in (
    select t.ID
    from (  select ID, row_number() over (partition by A,B,C order by A,B,C) as rn
            from T) as t
    where t.rn < (select max(t.rn) from t)
    )

↑ Delete duplicate keeping the last entry.

The problem is in the (select max(t.rn) from t) I am assuming this is a noob mistake related to not knowing postgres syntax when it comes to referencing columns with aliases?

Best Answer

First off, partition by A,B,C order by A,B,C makes no sense. Since you intent to keep the "last" row (meaning the one with the greatest ID), you probably meant:

partition by A,B,C order by ID

Still, the syntax is invalid on top of this. This subquery expression contains a reference to a column of the outer query: (select max(t.rn) from t). The scope of the subquery does not include columns from the outer query so rn is not visible in there. Only columns of the table t are.

You could use a CTE to allow the reference and make the syntax valid:

WITH cte AS (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id) AS rn FROM t)
DELETE FROM t
WHERE  id IN (
    SELECT id
    FROM   cte
    WHERE  rn < (SELECT max(rn) FROM cte)
    )

Still, the query is dangerous nonsense. Do not use this!
Comparing to the greatest row number is logical nonsense as each group of peers might have a different number of dupes. Would delete a lot more than it should.

Simpler, and correct:

DELETE FROM t
WHERE  id IN (
    SELECT t1.id
    FROM  (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id DESC) AS rn FROM t) t1
    WHERE  t1.rn > 1  --  all but the latest
    );

Which, in turn, can be had more cheaply as (assuming all columns NOT NULL!):

DELETE FROM t
WHERE  EXISTS (
   SELECT FROM t AS t1
   WHERE t1.a = t.a
   AND   t1.b = t.b
   AND   t1.c = t.c
   AND   t1.id > t.id
   );

"... where a dupe exists with a greater ID".

How do I remove duplicate records in a join table in PostgreSQL?

Related Solutions

PostgreSQL Database Design – Need Help Designing a Database

If you have an index on LastLoggedIn, deleting a few tens of thousands of records should be fast without needing to partition. I ran a small test on MS SQL:

CREATE TABLE Temp.Temp
(
UserID  INT NOT NULL PRIMARY KEY,
LastLoggedIn    DATETIME2(0) NOT NULL INDEX,
Dummy1  VARCHAR(30),
Dummy2  INT
)

I insert ~57 M dummy records, then selected a LastLoggedIn date such that 35,000 records were at least that old. Deleting these records took less than a second. This is on a four-core machine with 128 GB of memory, under light load; your mileage may vary.

Of course, if you run this purge only every few weeks, you may have millions of records to delete rather than thousands (300 K records took seven seconds for me; 1.3 M records took 30 seconds). But in that scenario, it's part of your monthly maintenance window rather than daily upkeep, and a brief delay should be acceptable.

If you get poor performance with just an index, then sure, consider partitioning. As others have pointed out, there will be a significant cost to UPDATE operations, a greater overhead than maintaining an index. You will have many more updates than purges, and the latter can happen during relatively idle periods, so I would worry about the updates more.

If you need always-on uptime, and the delete locks the table for an unacceptable length of time, that might justify using partitioning, but even there I'd look for alternatives. You could SELECT a list of old users and then delete them one at a time (confirming their LastLoggedIn dates first of course) with a cursor, for example.

I'm an MS SQL guy, so my apologies if any of this is incorrect for PostgreSQL.

SQL Server – How to Delete Duplicates Using Specific Columns as Identifier

--=======================================================
-- delete the duplicate records from table @t
-- keeping a single unit of each
-- marcelo miorelli 24-nov-2014
--=======================================================


--=======================================================
--create a table variable and insert records in it
-- just for this example
--=======================================================
declare @t table ([num] int, com varbinary(512) )

insert into @t select 1, convert(varbinary(512), '6778981' )
insert into @t select 1, convert(varbinary(512), '6778981' )
insert into @t select 1, convert(varbinary(512), '6778982' )
insert into @t select 2, convert(varbinary(512), '6778982' )
insert into @t select 2, convert(varbinary(512), '6778982' )
insert into @t select 3, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778985' )
insert into @t select 7, convert(varbinary(512), '6778985' )
insert into @t select 7, convert(varbinary(512), '6778985' )
insert into @t select 8, convert(varbinary(512), '6778985' )
insert into @t select 8, convert(varbinary(512), '6778985' )
insert into @t select 9, convert(varbinary(512), '6778985' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )

--=======================================================
-- this is going to show the table before the deletes
--=======================================================
select * from @t


--=======================================================
-- this deletes the duplicates - leaving only 1 of each
--=======================================================
; with cte as (
    select *
        , row_number() over (partition by [num],com  order by [num],com ) as Picker
    from @t
    )
delete cte 
where Picker > 1


--=======================================================
-- this shows the table after the deletion - without the deletes
--=======================================================
select * from @t

Best Answer

Related Solutions

PostgreSQL Database Design – Need Help Designing a Database

SQL Server – How to Delete Duplicates Using Specific Columns as Identifier

Related Question