Postgresql – Summing the count of two columns in postgres that contain the same value

countpostgresql

I want to sum the total number of occurrences of one value across two columns in the same table.
So, an example would be:

id |node1  |node2
1  |  111  |   123
2  |  122  |   124
3  |  111  |   125
4  |  122  |   111
5  |  124  |   111
6  |  126  |   111

So in this case I want to get the following result:

node   | node_count
111    |     5
122    |     2
123    |     1
124    |     2
125    |     1
126    |     1

Additionally, I want to only include the nodes that have a count > 1 so my final result would be:

node   | node_count
111    |     5
122    |     2
124    |     2

I didn't think this would work but I've tried the following on the table:

SELECT count(node1+node2), node1 as node 
FROM table1 
WHERE node1 = node2 
GROUP BY node1 
HAVING count(node1+node2) > 1;

So I then tried creating a temporary table so that I could use the WHERE clause as follows

SELECT count(table1.node1+tableTemp.node2), table1.node1 
FROM table1, tableTemp 
WHERE table1.node1 = tableTemp.node2 
GROUP BY table1.node1 
HAVING count(table1.node1+tableTemp.node2) > 1;

But this only seems to return the count of node1. I have also tried the variation of count(table1.node1) + count(tableTemp.node2) but this doesn't work. I've also tried using a combination of SUM and COUNT-sub-queries to no avail.
Can anyone point me in the correct direction? Cheers.

Best Answer

This should be substantially cheaper than what we had so far:

SELECT x.node, count(*) AS node_count
FROM   tbl t, LATERAL (VALUES (t.node1), (t.node2)) AS x(node)
GROUP  BY 1
HAVING count(*) > 1;

db<>fiddle here

Only needs a single pass over the underlying table.
Adding ORDER BY is optional, but no sort order was requested.

Related:

SELECT DISTINCT on multiple columns

Related Solutions

Mysql – Is SELECT COUNT GROUP BY more efficient than counting a result set

The answer depends a great deal on how well organized your data is and the query itself.

For example, look at the query you have in the question:

SELECT rank, COUNT(id) FROM tablename GROUP BY rank

The first thing I think about with this query is whether the table is properly indexed.

OBSERVATION #1

If tablename had no indexes, a full table scan would be required.

OBSERVATION #2

If tablename had an index on rank, you still get a full table scan because of the MySQL Query Optimizer ruling out the use of the index because of factors such as key distribution and the possibility of having to lookup each id for every rank during a full index scan.

OBSERVATION #3

If the table had a compound index of (rank,id), then you can a full index scan. In most cases, a full index scan that never references the table for non-indexed columns would be faster than a full index scan that does (See OBSERVATION #2)

OBSERVATION #4

If the query was written slightly different

SELECT rank, COUNT(1) FROM tablename GROUP BY rank

then an index on just the rank column would suffice and produce a full index scan.

CONCLUSION

In light of these observtions, it is definitely a thing of beauty to present to the MySQL Query Optimizer two things:

a good query
proper indexes for all tables in the query

In retrospect, it is also good to give the MySQL Query Optimizer as much of an advantage upfront as possible.

Postgresql – Finding earliest connected value over two columns

It's a recursive problem by nature. SQL introduced recursive CTEs for that purpose quite some time ago. Since Redshift does not seem to support this, your only remaining option to solve this within the database is a user defined function (UDF). It seems like the only server-side language supported by Redshift is plpythonu.

Here is an old answer with a PL/pgSQL function and an rCTE implementing the same solution side-by-side. But it's using the default procedural language of Postgres: PL/pgSQL:

Loop in function does not work as expected

No idea why Redshift chose not to support PL/pgSQL.

Related Question