Postgresql – Looking for a query that eliminates near-repeats

duplicationpostgresql

I have a database with four rows: county_a, county_b, flow_a_to_b, and flow_b_to_a. Essentially, the data is repeating in other rows, but the values are switched around. Here's an example:

Entry 1: Baltimore County, Baltimore City, 10, 1

Entry 2: Baltimore City, Baltimore County, 1, 10

These entries are technically different but they give me the same information. What kind of query could I write to eliminate the second value while keeping the first?

Best Answer

My sense is that you're doing something wrong and need better tooling like PostGIS and PgRouting just from the problem you're describing but perhaps something like this will work,

SELECT DISTINCT
  greatest(name1, name2),
  least(name1, name2),
  greatest(x1,x2),
  least(x1,x2)
FROM tbl;

If you're using PostGIS then you can use spatial equality,

SELECT max(coalesce(t1.name,t2.name))
FROM table AS t1
JOIN table AS t2
  ON ST_Equals(t1.line,t2.line)
GROUP BY ST_Equals(t1.line,t2.line);

If you're using PgRouting none of this matters because the edges become the same anyway. Edges are bidirectional.

Related Solutions

PostgreSQL – How to Remove Duplicates and Keep the Most Informative Record

Your question is a bit light on definitions. Assuming:

You define the "least amount of information" with how many of the relevant columns are NULL.
Primary key is adr_id.
Duplicates are marked with a column dupe_id to indicate groups of duplicates.

Since it's also vague what to do exactly, I create a separate table with the dupe ranking:

CRATE TABLE adr_dupe_rank AS
SELECT adr_id, dupe_id
     , rank() OVER (PARTITION BY dupe_id
                    ORDER BY (nr_of_beds  IS NULL)::int
                           + (nr_of_baths IS NULL)::int
                        -- + (...)::int  -- more terms
                   ) AS rnk
FROM   address;

false translate to 0, true translates to 1. So rows with the fewest empty columns are ranked first. Master rows end up with rnk = 1. Dupes get higher rnk numbers.

The window function rank() assigns 1 to all rows sharing the lowest score per dupe_id. Add enough columns or other terms to break ties and get one winner per dupe_id. Or deal with multiple winners separately.

You can then do as you please. To just delete dupes:

DELETE FROM address a
USING  adr_dupe_rank d
WHERE  a.adr_id = d.adr_id
AND    d.rnk > 1;

Alternatively you can use the above query as derived table in the DELETE directly.

PostgreSQL – How to Optimize Slow Query Performance

Looks like (I'm by no means a Postgres expert) you have about two million rows in LEAD_DETAIL table that satisfy the state condition... those two million rows are retrieved, hash joined to leads, the result sorted to return the top 5000 rows. Could you move the state to leads table, so that it becomes a column there? Create and index on (upper(state), created_date_time) and try again (of course, rewriting your query to use the new state column in leads table).

And by the way, most of your indexes on leads table are useless. Why have lead_created_date_time_idx and lead_created_date_time_idx1 - they are the same, an ascending index can be used also for DESC sorts. Furthermore, why include lead_id in indexes (except the one supporting the primary key constraint)?

Best Answer

Related Solutions

PostgreSQL – How to Remove Duplicates and Keep the Most Informative Record

PostgreSQL – How to Optimize Slow Query Performance

Related Question