Postgresql – How to write join query in sub query method

postgresql

This is the question:
What is the most common stackoverflow tag_type? What companies have a tag of that type?

My Solution:

SELECT type, count(tag) AS count
  FROM tag_type
group by type
order by count desc;

Result of this query is:

type       count
cloud       31
database     6

And so on but I selected cloud in the type column as mentioned below and got my result which is correct.

SELECT tag_company.tag, company.name, tag_type.type
  FROM company
       -- Join to the tag_company table
      INNER JOIN tag_company 
      ON company.id = tag_company.company_id
      -- Join to the tag_type table
      INNER JOIN tag_type
      ON tag_company.tag = tag_type.tag
  -- Filter to most common type
  WHERE type='cloud';

My Question:
I have got my desired result though but I want to eliminate the manual work by simply typing (type =cloud) and looking for dynamic way of solving this query. How can I do that?

And how can this query be solved with sub query method? Please help.

Or how could I combine these steps in a single query by using a subquery in the WHERE clause instead of the value 'cloud'?

Best Answer

I think this question will be fine here, though it could as well have been asked at SO. One nice property of the relational model is that it is closed under relational operations, that is, the result of a query is a new relation. I'll use Common Table Expressions (CTE), but you may as well use a subquery:

WITH type_counts (type, cnt) AS -- type is a bad identifier,
                                -- but I'll ignore that
( SELECT type, count(tag) AS count
  FROM tag_type
  group by type
), company_tag_info AS 
( SELECT tag_company.tag, company.name, tag_type.type
  FROM company
  JOIN tag_company 
      ON company.id = tag_company.company_id
  JOIN tag_type
      ON tag_company.tag = tag_type.tag
)
SELECT cti.*
FROM type_counts tc
JOIN company_tag_info cti
    ON cti.type = tc.type

If you just want information related to the top type, you can order by cnt and limit to one:

WITH type_counts (type, cnt) AS -- type is a bad identifier,
                                -- but I'll ignore that
( SELECT type, count(tag) AS count
  FROM tag_type
  group by type
), company_tag_info AS 
( SELECT tag_company.tag, company.name, tag_type.type
  FROM company
  JOIN tag_company 
      ON company.id = tag_company.company_id
  JOIN tag_type
      ON tag_company.tag = tag_type.tag
)
SELECT cti.*
FROM type_counts tc
JOIN company_tag_info cti
    ON cti.type = tc.type
ORDER BY tc.cnt DESC
LIMIT 1;

This can be further simplified, but I believe that the above is the most relevant part for your question.

Procedural solution with PL/pgSQL

CREATE OR REPLACE FUNCTION f_next_round()
  RETURNS TABLE (player_id1 int, player_id2 int) AS
$func$
DECLARE
   rows int := (SELECT count(*)/2 FROM tbl);  -- expected number of resulting rows
   ct   int := 0;                             -- running count
BEGIN

CREATE TEMP TABLE t ON COMMIT DROP AS         -- possible combinations
SELECT t1.player_id AS p1, t2.player_id AS p2
     , COALESCE(array_length(t1.opp_log,1), 0) AS len1
     , COALESCE(array_length(t2.opp_log,1), 0) AS len2
FROM   tbl t1, tbl t2 
WHERE  t2.player_id <> t1.player_id
AND    t2.player_id <> ALL (t1.opp_log)
AND    t1.player_id <> ALL (t2.opp_log)
ORDER  BY len1 DESC, len2 DESC;               -- opportune sort order

LOOP
   SELECT INTO player_id1, player_id2  p1, p2 FROM t LIMIT 1;

   EXIT WHEN NOT FOUND;
   RETURN NEXT;
   ct := ct + 1;                              -- running count

   DELETE FROM t                              -- remove obsolete pairs
   WHERE  p1 IN (player_id1, player_id2) OR 
          p2 IN (player_id1, player_id2);
END LOOP;

IF ct < rows THEN
   RAISE EXCEPTION 'Could not find a solution';
ELSIF ct > rows THEN
   RAISE EXCEPTION 'Impossible result!';
END IF;

END
$func$  LANGUAGE plpgsql VOLATILE;

How?

Build a temporary table with remaining possible pairs. This kind of cross join produces a lot of rows with big tables, but since we seem to be talking about tournaments, numbers should be reasonably low.

Players with the longest list of opponents are sorted first. This way, players that would be hard to match come first, increasing the chance for a solution.

Pick the first row and delete related pairings now obsolete. Do need to sort again. Logically any row is good, practically we get the player with the longest list of opponents first due to initial sort (which is not reliable without ORDER BY, but good enough for the case).

Repeat until no match is left.
Keep count and raise an exception if the count is not as expected. PL/pgSQL conveniently allows to raise an exception after the fact, which cancels any previous return values. Details in the manual.

Call:

SELECT * FROM f_next_round();

Result:

player_id1 | player_id2
-----------+-----------
1          | 7
2          | 3
4          | 8
5          | 6

SQL Fiddle.

Note

This does not guarantee to calculate the perfect solution. I just returns a possible solution and uses some limited smarts to improve the chances to find one. The problem is a bit like solving a Sudoku, really and is not trivially solved perfectly.

PostgreSQL Performance – Use Nested Loop with Indices Over Hash Join

This closely related answer on SO should provide answers to your primary question:
Setting enable_seqscan = off in a single SELECT query

You could use in similar fashion, to disable hash joins for the current transaction:

SET LOCAL enable_hashjoin=off;

But that's not my advice. Read the answer over there.
And this one about statistics and cost settings, too.

More importantly, untangle your query first:

SELECT creation_epoch, user_screen_name, chunk
FROM  (
   SELECT id AS owner_user_id
   FROM   users
   WHERE  reputation > 100000
   ORDER  BY reputation 
   LIMIT  500
   ) u
JOIN   posts p USING (owner_user_id)
JOIN   post_tokenized t USING (id)
WHERE  type = 'tag'
AND    user_screen_name IS NOT NULL;

Should be considerably faster and also make it easier for the query planner to choose the best plan (given sane cost settings and table statistics).

Best Answer

Related Solutions

PostgreSQL Self-Join – How to Create Unique Pairs

Procedural solution with PL/pgSQL

How?

Note

PostgreSQL Performance – Use Nested Loop with Indices Over Hash Join

Related Question