Looking for help reporting duplicates and null marks

duplicationoracleself-join

Someone over at StackOverflow suggested this might be the place to ask this. I hope this will question is proper for this forum. I am looking for suggestions / criticisms of alternative methods of finding errors in an Oracle table. I am trying to report both duplicates and missing values something like this:

ID      MAINFIELD   LOCATION                            COUNTER
------- ---------   ---------------------------------   -------
16626   206000650   9A OLIVER ST CENTRAL STATION        2
18805   206000650   3 SWIFT CT CENTRAL STATION          2
22409   940000170   2 MARKET ST NEWARK DE               2
22003   940000170   1 MARKET ST NEWARK NJ               2
29533   970000030   95 MILL RD ANDOVER                  2
20256   970000030   12 RAILROAD AVE                     2
29018   978900050   44 BROAD STREET                     2
28432   978900050   WASHINGTON ST AND HAMILTON AVE      2
21831   980700050   BROADWAY NEWTOWN                    2
24147   980700050   MAIN STREET LEVITTOWN               2
26418                                                   3
26738               TEST DATA                           3
26755                                                   3

The ID is the primary key, is a number and not nullable. The other columns are both NVARCHAR2 and nullable. None of those are indexed. I am not permitted to alter the tables or otherwise modify the database. Some rows that I am reporting are outright errors. Some are obvious typos. Some appear to be test data. The last three rows have a null MAINFIELD and there are three such rows (two of which have the location null also).

This was my original solution and produced what is shown above.
This appears to achieve the desired result:

SELECT a.id, 
       a.mainfield, 
       a.location, 
       b.counter 
FROM   maintable a 
       INNER JOIN (
                    SELECT mainfield, 
                            Count(*) counter 
                    FROM   maintable 
                    GROUP  BY mainfield 
                    HAVING Count(mainfield) > 1 OR mainfield IS NULL
                  ) b ON a.mainfield = b.mainfield OR
                  ( a.mainfield IS NULL AND b.mainfield IS NULL ) 
ORDER  BY a.mainfield; 

I found this to run more quickly:

SELECT a.id, 
       a.mainfield, 
       a.location, 
       b.counter 
FROM   maintable a 
       INNER JOIN (
                    SELECT mainfield, 
                            Count(*) counter 
                    FROM   maintable 
                    GROUP  BY mainfield 
                    HAVING Count(mainfield) > 1 OR mainfield IS NULL
                  ) b ON NVL(a.mainfield,'***NULL***') = NVL(b.mainfield.'***NULL***') 
ORDER  BY a.mainfield; 

I liked this alternative, but it runs more slowly:

SELECT  id, 
        mainfield, 
        location,
        COUNT (id) OVER (PARTITION BY mainfield) counter
FROM    maintable a
WHERE   mainfield IS NULL
OR      EXISTS(SELECT 1 from maintable b
            WHERE mainfield = a.mainfield AND ROWID <> a.ROWID)
ORDER BY a.mainfield;

Is there a way to reference the counter [the COUNT (id) over PARTITION BY mainfield] in the WHERE clause which would allow me to speed this up?

I apologize in advance if this question is trivial and if there is some rule against cross-posting. I am not particularly strong in this area.


Putting the following code in a response does not format it properly and this isn't so much an answer as a follow-up to the answer posted, so I am adding it here rather than in an "answer."

Is there any reason why I might want to structure the query like this instead:

WITH TempTable AS
    (
    SELECT  id, 
        mainfield, 
        location,
        COUNT(id) OVER (PARTITION BY mainfield) counter
    FROM    maintable 
    ORDER BY mainfield
    )
SELECT  * 
FROM    TempTable 
WHERE   counter > 1 or mainfield is null
;

I like the syntax in the answer a bit better since it is easier to remember. What I have above is something I was experimenting with after googling around the web.

Thanks.

Best Answer

Of course.

SELECT * FROM (
SELECT  id, 
        mainfield, 
        location,
        COUNT (id) OVER (PARTITION BY mainfield) counter
FROM    maintable a
) where counter > 1 or mainfield IS NULL
ORDER BY mainfield;