Postgresql – using lag() on multiple columns (gaps and islands issue)

postgresqlwindow functions

I have a problem I am trying to solve where I need to eliminate duplicate consecutive rows in a table using the first 7 columns as the basis to determine if it is a duplicate. The data looks like:

textColA|textColB|textColC|textColD|textColE|numCol1|numCol2|eventNum
--------+--------+--------+--------+--------+-------+-------+--------
    x   |    y   |    z   |  foo   |  bar   |  1    | 2     | 1
    x   |    y   |    z   |  foo   |  bar   |  1    | 2     | 2 
    x   |    y   |    z   |  foo   |  bar   |  1.1  | 2     | 3
    x   |    y   |    z   |  foo   |  bar   |  1.1  | 2     | 4
    x   |    y   |    z   |  foo   |  bar   |  1    | 2     | 5
    x   |    y   |    z   |  foo   |  bar   |  1    | 2     | 6

With these data in the order displayed, I want to retain eventNum 1,3 and 5 – event 5 because even though it is the same as 1 & 2 but follows 4, which is not the same.

I have the code to do it using a cursor looping through each row and doing the comparisons but while this will work for a table with 1, 2 or even 10 million rows, it will likely be problematic on the 300-400m row tables I need to process.

I've looked at different solutions using the lag() function but the examples I see only deal with one column and I can't seem to the get the syntax right for using multiple columns.

Any help is appreciated.

Best Answer

lag() only works over a single column, so you need to define one new "column" for each lag value. You can use a common Window clause so that it's clear to the query planner that it only needs to sort the rows once -- although I think Postgres can determine this for itself too -- but it also saves typing.

After that it's just a matter of comparing the "lagged" values to see if they are in any way different. is distinct from is handy here as the first "lagged" values will all be null, so they are distinct from the first row's values and so the first row will be included in the output.

Putting it all together, it works like this:

select textColA, textColB, textColC, textColD, textColE, numCol1, numCol2, eventNum
from (
  select
    l.*,
    lag(textColA) over w as lagTextColA,
    lag(textColB) over w as lagTextColB,
    lag(textColC) over w as lagTextColC,
    lag(textColD) over w as lagTextColD,
    lag(textColE) over w as lagTextColE,
    lag(numCol1) over w as lagNumCol1,
    lag(numCol2) over w as lagNumCol2
  from lagger l
  window w as (order by textColA, textColB, textColC, textColD, textColE, eventNum)
) x
where
   textColA is distinct from lagTextColA
or textColB is distinct from lagTextColB
or textColC is distinct from lagTextColC
or textColD is distinct from lagTextColD
or textColE is distinct from lagTextColE
or numCol1 is distinct from lagNumCol1
or numCol2 is distinct from lagNumCol2
;

Complete fiddle is here: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=2e11af872a47bd75a20c64f87e7a6cd6 (NB In the fiddle I forgot the other columns in the Window clause)