How to Count DISTINCT Values Outside GROUP BY in PostgreSQL

group bypostgresqlpostgresql-9.6

I am refactoring some business logic that we normally do as nightly processes in Java, and am now attempting to move into PostgreSQL as a set of materialized views. I've reduced the problem to its essentials, which results in the three tables below (category_source ~10 rows, category ~100k rows, category_tags ~10M rows).

The logic is fairly simple: for each Source, sum the values of each Tag in the set of Categories in that Source, and divide each of those sums by the total number of Categories in the Source.

--DROP TABLE category_tags;
--DROP TABLE category;
--DROP TABLE category_source;

CREATE TABLE category_source
(
  id integer NOT NULL,
  PRIMARY KEY (id)
);

CREATE TABLE category
(
  label text NOT NULL,
  source integer NOT NULL REFERENCES category_source(id),
  PRIMARY KEY (label)
);

CREATE TABLE category_tags
(
  category text NOT NULL REFERENCES category(label),
  tag text NOT NULL,
  rank real NOT NULL,
  PRIMARY KEY (category,tag)
);

-- Load sample data.
INSERT INTO category_source VALUES (1);
INSERT INTO category_source VALUES (2);

INSERT INTO category VALUES ('C3035240',1);
INSERT INTO category VALUES ('C3035245',1);
INSERT INTO category VALUES ('C3035250',2);

INSERT INTO category_tags VALUES ('C3035240','test',24.00);
INSERT INTO category_tags VALUES ('C3035240','sample',24.00);
INSERT INTO category_tags VALUES ('C3035240','method',20.00);
INSERT INTO category_tags VALUES ('C3035240','variety',18.00);
INSERT INTO category_tags VALUES ('C3035240','explanation',15.00);
INSERT INTO category_tags VALUES ('C3035245','test',20.00);
INSERT INTO category_tags VALUES ('C3035245','extra',21.00);
INSERT INTO category_tags VALUES ('C3035245','method',20.00);
INSERT INTO category_tags VALUES ('C3035245','sample',18.00);
INSERT INTO category_tags VALUES ('C3035245','question',15.00);
INSERT INTO category_tags VALUES ('C3035250','method',10.00);
INSERT INTO category_tags VALUES ('C3035250','explanation',8.00);
INSERT INTO category_tags VALUES ('C3035250','test',6.00);
INSERT INTO category_tags VALUES ('C3035250','question',5.00);
INSERT INTO category_tags VALUES ('C3035250','sample',2.00);
INSERT INTO category_tags VALUES ('C3035250','variety',4.00);

Query 1 here gets most of the way there, but the source_category_count column contains the number of Categories that the Tag is in for the Source (varies), whereas what I really want is the total number of Categories in the Source.

SELECT category_source.id,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_count,
       SUM(category_tags.rank)/COUNT(category.label) AS source_tag_rank
 FROM
  category_source,category,category_tags
 WHERE
  category_source.id=category.source
  AND category.label=category_tags.category
 GROUP BY category_source.id,category_tags.tag ORDER BY category_source.id,tag_total DESC;

 id |     tag     | tag_total | source_category_count | source_tag_rank
----+-------------+-----------+-----------------------+-----------------
  1 | test        |        44 |                     2 |              22
  1 | sample      |        42 |                     2 |              21
  1 | method      |        40 |                     2 |              20
  1 | extra       |        21 |                     1 |              21
  1 | variety     |        18 |                     1 |              18
  1 | explanation |        15 |                     1 |              15
  1 | question    |        15 |                     1 |              15
  2 | method      |        10 |                     1 |              10
  2 | explanation |         8 |                     1 |               8
  2 | test        |         6 |                     1 |               6
  2 | question    |         5 |                     1 |               5
  2 | variety     |         4 |                     1 |               4
  2 | sample      |         2 |                     1 |               2
(13 rows)

Query 2 below produces the results I am really looking for:

SELECT q1.*,q2.source_category_count,q1.tag_total/q2.source_category_count AS tag_source_rank FROM
(
  SELECT category_source.id AS source,category_tags.tag,SUM(category_tags.rank) AS tag_total
   FROM category_source
   INNER JOIN category ON (category_source.id=category.source)
   INNER JOIN category_tags ON (category.label=category_tags.category)
   GROUP BY category_source.id,category_tags.tag
) q1,
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) q2
 WHERE q1.source=q2.source
 ORDER BY source,tag_source_rank DESC
;

     source |     tag     | tag_total | source_category_count | tag_source_rank
    --------+-------------+-----------+-----------------------+-----------------
          1 | test        |        44 |                     2 |              22
          1 | sample      |        42 |                     2 |              21
          1 | method      |        40 |                     2 |              20
          1 | extra       |        21 |                     2 |            10.5
          1 | variety     |        18 |                     2 |               9
          1 | explanation |        15 |                     2 |             7.5
          1 | question    |        15 |                     2 |             7.5
          2 | method      |        10 |                     1 |              10
          2 | explanation |         8 |                     1 |               8
          2 | test        |         6 |                     1 |               6
          2 | question    |         5 |                     1 |               5
          2 | variety     |         4 |                     1 |               4
          2 | sample      |         2 |                     1 |               2
    (13 rows)

Query 3 produces equivalent results using WITH x () SELECT ...:

WITH category_counts AS
(
  SELECT source,COUNT(category) AS source_category_count FROM category GROUP BY source
) 
SELECT category_counts.source,category_tags.tag,
       SUM(category_tags.rank) AS tag_total,
       COUNT(category.label) AS source_category_freq,
       category_counts.source_category_count,
       SUM(category_tags.rank)/category_counts.source_category_count AS source_tag_rank
 FROM category_counts
 INNER JOIN category ON (category_counts.source=category.source)
 INNER JOIN category_tags ON (category.label=category_tags.category)
 GROUP BY category_counts.source,category_counts.source_category_count,category_tags.tag ORDER BY category_counts.source,tag_total DESC;

 source |     tag     | tag_total | source_category_freq | source_category_count | source_tag_rank
--------+-------------+-----------+----------------------+-----------------------+-----------------
      1 | test        |        44 |                    2 |                     2 |              22
      1 | sample      |        42 |                    2 |                     2 |              21
      1 | method      |        40 |                    2 |                     2 |              20
      1 | extra       |        21 |                    1 |                     2 |            10.5
      1 | variety     |        18 |                    1 |                     2 |               9
      1 | explanation |        15 |                    1 |                     2 |             7.5
      1 | question    |        15 |                    1 |                     2 |             7.5
      2 | method      |        10 |                    1 |                     1 |              10
      2 | explanation |         8 |                    1 |                     1 |               8
      2 | test        |         6 |                    1 |                     1 |               6
      2 | question    |         5 |                    1 |                     1 |               5
      2 | variety     |         4 |                    1 |                     1 |               4
      2 | sample      |         2 |                    1 |                     1 |               2
(13 rows)

Although I have two working queries that both produce the results I am looking for, I am unsatisfied with using two subqueries over tables of this size (I have not yet loaded in all of my data or done any performance testing, I am simply working on this test case at the moment).

I feel that the value for source_category_count that I am looking for is buried somewhere in Query 1 and I just don't know how to access it.

Another alternative I am investigating is COUNT() OVER (PARTITION BY category_source), but I do not have a working query for that method at the moment.

Is there a simpler query that will produce the same results as Query 2 or Query 3 (i.e., a modification of Query 1)?

Update: Added a second working query.

Best Answer

Instead of joining straight to the category table you can join instead to a subquery with a window function. You only query the table once and also get your end result.

SELECT category_source.id AS source,
    category_tags.tag,
    SUM(category_tags.rank) AS tag_total,
    MAX(category_source_count) AS source_category_count,
    SUM(category_tags.rank)/MAX(category_source_count) AS source_tag_rank
FROM category_source
   INNER JOIN 
        (
        SELECT label, source, 
            COUNT(*) OVER (PARTITION BY source) category_source_count 
            FROM category
        ) AS category ON category_source.id=category.source
   INNER JOIN category_tags ON category.label=category_tags.category
GROUP BY category_source.id,category_tags.tag 
ORDER BY category_source.id,tag_total DESC;