SQL Server – Handling Duplicate Rows Based on Columns

sql server

Here is what I am trying to do,

Get duplicates based on 2 columns (let say returns 500 rows)
Get duplicates based on above 2 columns + another column (let say returns 100 rows)

Now I want to get remaining 400 rows. In simple words, I want all duplicates where there are not duplicates because of columnC…

-- get duplicates based on ColumnA, ColumnB
SELECT '-'
    ,ColumnA
    ,ColumnB
    ,ColumnC
    ,COUNT(*)
FROM MainTable
     ...SOME joins(INNER AND left)
WHERE ColumnA IS NOT NULL
GROUP BY ColumnA
    ,ColumnB
    ,ColumnC
HAVING COUNT(*) > 1

EXCEPT

-- get duplicates based on ColumnA, ColumnB, ColumnC
SELECT '-'
    ,ColumnA
    ,ColumnB
    ,ColumnC
    ,COUNT(*)
FROM MainTable
     ...SOME joins(INNER AND left)
WHERE ColumnA IS NOT NULL
GROUP BY ColumnA
    ,ColumnB
    ,ColumnC
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC

I am just not able to complete this query 🙁

Best Answer

Using window functions will likely be simpler here:

DB<>Fiddle

WITH cte AS
( 
    SELECT 
        ColumnA,
        ColumnB,
        ColumnC,
        COUNT(*) OVER (PARTITION BY ColumnA, ColumnB)
          AS count_ab,
        COUNT(*) OVER (PARTITION BY ColumnA, ColumnB, ColumnC)
          AS count_abc
    FROM MainTable
         ...SOME joins(INNER AND left)
    WHERE ColumnA IS NOT NULL
)
SELECT
    ColumnA,
    ColumnB,
    ColumnC,
    count_ab
FROM 
    cte
WHERE
    count_ab > 1
  AND
    count_abc = 1 ;

Related Solutions

Does Detach/Attach or Offline/Online Clear Buffer Cache for Database?

I initially thought you were on to something here. Working assumption was along the lines that perhaps the buffer pool wasn't immediately flushed as it requires "some work" to do so and why bother until the memory was required. But...

Your test is flawed.

What you're seeing in the buffer pool is the pages read as a result of re-attaching the database, not the remains of the previous instance of the database.

And we can see that the buffer pool was not totally blown away by the detach/attach. Seems like my buddy was wrong. Does anyone disagree or have a better argument?

Yes. You're interpreting physical reads 0 as meaning there were not any physical reads

Table 'DatabaseLog'. Scan count 1, logical reads 782, physical reads 0, read-ahead reads 768, lob logical reads 94, lob physical reads 4, lob read-ahead reads 24.

As described on Craig Freedman's blog the sequential read ahead mechanism tries to ensure that pages are in memory before they're requested by the query processor, which is why you see zero or a lower than expected physical read count reported.

When SQL Server performs a sequential scan of a large table, the storage engine initiates the read ahead mechanism to ensure that pages are in memory and ready to scan before they are needed by the query processor. The read ahead mechanism tries to stay 500 pages ahead of the scan.

None of the pages required to satisfy your query were in memory until read-ahead put them there.

As to why online/offline results in a different buffer pool profile warrants a little more idle investigation. @MarkSRasmussen might be able to help us out with that next time he visits.

Sql-server – Insert query with a subquery

If the (subquery) has SELECT (whatever expression) AS col FROM ..., then you can do:

INSERT INTO mytable 
  (col1, col2, col3, col4) 
SELECT 
  val1, s.col, val2, val3
FROM 
  (subquery) AS s ;

or:

WITH s (col) AS
  (subquery)
INSERT INTO mytable 
  (col1, col2, col3, col4) 
SELECT 
  val1, s.col, val2, val3
FROM 
  s ;

Best Answer

Related Solutions

Does Detach/Attach or Offline/Online Clear Buffer Cache for Database?

Sql-server – Insert query with a subquery

Related Question