Postgresql – Looking for a query that eliminates near-repeats

duplicationpostgresql

I have a database with four rows: county_a, county_b, flow_a_to_b, and flow_b_to_a. Essentially, the data is repeating in other rows, but the values are switched around. Here's an example:

Entry 1: Baltimore County, Baltimore City, 10, 1

Entry 2: Baltimore City, Baltimore County, 1, 10

These entries are technically different but they give me the same information. What kind of query could I write to eliminate the second value while keeping the first?

Best Answer

My sense is that you're doing something wrong and need better tooling like PostGIS and PgRouting just from the problem you're describing but perhaps something like this will work,

SELECT DISTINCT
  greatest(name1, name2),
  least(name1, name2),
  greatest(x1,x2),
  least(x1,x2)
FROM tbl;

If you're using PostGIS then you can use spatial equality,

SELECT max(coalesce(t1.name,t2.name))
FROM table AS t1
JOIN table AS t2
  ON ST_Equals(t1.line,t2.line)
GROUP BY ST_Equals(t1.line,t2.line);

If you're using PgRouting none of this matters because the edges become the same anyway. Edges are bidirectional.