PostgreSQL 9.2/9.3 – Get Distinct Results Based on Column and Condition

postgresqlpostgresql-9.2postgresql-9.3

I have the below select which is selecting from a table with the columns:

id, project_id, keyword_id, position,
Data example:

i, p_id, k_id, p 
1,  1,    1,   4
2,  1,    1,   5
3,  1,    1,   6
5,  1,    2,   7
6,  1,    2,   8
7,  1,    3,   5
8,  1,    3,   7

QUERY

SELECT
  pr.project_id,
  COUNT(DISTINCT pr.keyword_id) as total_keywords,
  COUNT(CASE WHEN position BETWEEN 4 AND 10 THEN position ELSE NULL END ) AS pos4,
  date(pr.created_at) AS created_at
FROM
  project_reports pr
GROUP BY pr.project_id, date(created_at)

I only want to get the number of distinct keyword_id in pos 4-10. But using my SELECT I get all of them.

Expected Data example:

p_id, total_keywords, pos4
1,      3,            3

My current Data output

p_id, total_keywords, pos4
1,      3,             7

Best Answer

If you want the distinct keywords then you need to count the keyword_id, not the position column.

So your expression:

COUNT(CASE WHEN position BETWEEN 4 AND 10 THEN position ELSE NULL END )

becomes:

COUNT(distinct CASE WHEN position BETWEEN 4 AND 10 THEN keyword_id ELSE NULL END )

So the final statement is:

SELECT
  pr.project_id,
  COUNT(DISTINCT pr.keyword_id) as total_keywords,
  count(distinct CASE WHEN position BETWEEN 4 AND 10 THEN keyword_id ELSE NULL END ) AS pos4
FROM
  project_reports pr
GROUP BY pr.project_id;

SQLFiddle example: http://sqlfiddle.com/#!15/b1015/1

Related Solutions

PostgreSQL Query – Select Data Based on Creation Time

If I guessed the missing pieces correctly, this query using a crosstab() function should do the job (and much faster than the original - except for small cardinalities where not much can be gained to begin with):

It requires that you install the additional module tablefunc first (once per database. Detailed instructions in this related answer on SO:
PostgreSQL Crosstab Query

SELECT name, project_id, positions, created_day
      ,yesterday, last_week, last_month, three_month, six_month, last_year
FROM  (
   SELECT k.name, pr.keyword_id, pr.project_id, pr.created_at::date AS created_day
        , array_agg(DISTINCT pr.position) AS positions
   FROM   project_reports  pr
   JOIN   keywords         k  ON k.id = pr.keyword_id
   GROUP  BY k.id, pr.project_id, created_day -- k.id being the pk
   ) sub
LEFT JOIN crosstab(
    $$SELECT ARRAY[keyword_id, project_id], created_at::date, array_agg(DISTINCT position)
      FROM   project_reports
      WHERE  created_at::date IN (
                    current_date - 1
                  , current_date - 7
                  , (now() - interval '1 mon')::date
                  , (now() - interval '3 mon')::date
                  , (now() - interval '6 mon')::date
                  , (now() - interval '12 mon')::date
                  )
      GROUP  BY 1,2
      ORDER  BY 1,2$$
   ,$$VALUES 
       (current_date - 1)
      ,(current_date - 7)
      ,((now() - interval '1 mon')::date)
      ,((now() - interval '3 mon')::date)
      ,((now() - interval '6 mon')::date)
      ,((now() - interval '12 mon')::date)$$
   ) AS t (
        kp int[]
      , yesterday   int[]
      , last_week   int[]
      , last_month  int[]
      , three_month int[]
      , six_month   int[]
      , last_year   int[]
      ) ON t.kp[1] = keyword_id
       AND t.kp[2] = project_id;

To break it down, a simple demo of what I am doing in the crosstab:

SELECT * FROM crosstab(
 $$SELECT *
   FROM  (
      VALUES
        ('{1,2}'::int[], current_date - 1, '{1,2}'::int[])
      , ('{1,3}'::int[], current_date - 1, '{1,2}'::int[])
      , ('{1,2}', current_date - 7, '{5,6,7}'::int[])
      , ('{1,3}', current_date - 7, '{5,6,7}'::int[])
      , ('{1,2}', (now() - interval '1 mon')::date, '{9,10}'::int[])
       -- 3 mon missing
      , ('{1,2}', (now() - interval '6 mon')::date, '{6,6,6}'::int[])
       -- 12 mon missing
      ) sub
    ORDER BY 1,2$$
   
,$$VALUES 
    (current_date - 1)
   ,(current_date - 7)
   ,((now() - interval '1 mon')::date)
   ,((now() - interval '3 mon')::date)
   ,((now() - interval '6 mon')::date)
   ,((now() - interval '12 mon')::date)$$
) AS t (
     kp int[]
   , yesterday   int[]
   , last_week   int[]
   , last_month  int[]
   , three_month int[]
   , six_month   int[]
   , last_year   int[]
   );

Result:

  kp   | yesterday | last_week | last_month | three_month | six_month | last_year
-------+-----------+-----------+------------+-------------+-----------+-----------
 {1,2} | {1,2}     | {5,6,7}   | {9,10}     |             | {6,6,6}   |
 {1,3} | {1,2}     | {5,6,7}   |            |             |           |

The rest should be obvious.

Comments

I de-coupled the two parts and fused the plethora of correlated subqueries into a single subquery, which should be much faster. Then I use cross tabulation pull individual columns from this query. The syntax for crosstab() is the tricky part.
I am not joining to project_keywords, it seems to be just noise. I am suspecting problems in your relational model.
I seems you should include the project_id in the result to make sense. Else you should not GROUP BY it ...
You can just subtract integer from date.

PostgreSQL – Select Data Based on Records Created_at Column

Note the slightly modified schema of my test in the fiddle.
Using actual primary keys and proper column names instead of id.

Also, you seem to be operating with dates exclusively. So I suggest to convert your timestamp columns to date.

Items 1 and 2

SELECT k.keyword_id
     , k.name
     , pr.project_id
     , COALESCE(min(pr.position), 0) AS pos
     , COALESCE(pr.created_at, now()::date) AS created_at
FROM   keyword             k 
LEFT   JOIN project_report pr USING (keyword_id)
GROUP  BY k.keyword_id, pr.project_id, pr.created_at
ORDER  BY keyword_id, created_at
;

In Postgres 9.1 or later the pk column covers the whole table in GROUP BY.
Use COALESCE to replace possible NULL values.

A guess at item 3

WITH cte AS (
   SELECT k.keyword_id
        , k.name
        , pr.project_id
        , COALESCE(min(pr.position), 0) AS pos
        , COALESCE(pr.created_at, now()::date) AS created_at
   FROM   keyword   k 
   LEFT   JOIN project_report pr USING (keyword_id)
   GROUP  BY k.keyword_id, pr.project_id, pr.created_at
   )
, x AS (
   SELECT DISTINCT ON (keyword_id, project_id) *
   FROM   cte
   ORDER  BY keyword_id, project_id, created_at DESC
   )
SELECT x.*
     , COALESCE(y.pos, 0) AS yesterday_pos 
     , COALESCE(w.pos, 0) AS week_pos 
     , COALESCE(m.pos, 0) AS month_pos 
FROM   x
LEFT   JOIN  cte y ON y.keyword_id = x.keyword_id
                  AND y.project_id = x.project_id
                  AND y.created_at = x.created_at - interval '1 day'
LEFT   JOIN  cte w ON w.keyword_id = x.keyword_id
                  AND w.project_id = x.project_id
                  AND w.created_at = x.created_at - interval '1 week'
LEFT   JOIN  cte m ON m.keyword_id = x.keyword_id
                  AND m.project_id = x.project_id
                  AND m.created_at = x.created_at - interval '1 month'
;

Explain

In CTE cte produce daily aggregates per (k.keyword_id, pr.project_id).
In CTE x pick the latest day per (k.keyword_id, pr.project_id).
In the outer query LEFT JOIN the latest day x to cte multiple times to retrieve past values from the same (k.keyword_id, pr.project_id) for 1 day / week / month earlier.

SQL Fiddle.