Postgresql – Select query for returning different data based on creation time

aggregatepivotpostgresqlsubquery

I have a view, which will select data from table reports. The select will filter the data based on multiple dates.

reports table: id, key_id, position, created_at

My select:

pr.key_id,
COALESCE(CASE WHEN date(pr.created_at) = date(current_date - interval '1' day) THEN array_agg(DISTINCT pr.position) ELSE '{0}' END ) AS yesterday,
COALESCE(CASE WHEN date(pr.created_at) = date(current_date - interval '2' day) THEN array_agg(DISTINCT pr.position) ELSE '{0}' END ) AS last_week,
COALESCE(CASE WHEN date(pr.created_at) = date(current_date - interval '3' month) THEN array_agg(DISTINCT pr.position) ELSE '{0}' END ) AS last_month,

Even if I group by key_id and date the result is not grouped.

Example of result I get now:

key_id: 1 yesterday: {1} last_week: {0} last_month: {0} created_at: '2014-04-11'
key_id: 1 yesterday: {0} last_week: {1} last_month: {0} created_at: '2014-04-10'
key_id: 1 yesterday: {0} last_week: {0} last_month: {1} created_at: '2014-04-09'

What I expect:

key_id: 1 yesterday: {1} last_week: {1} last_month: {1} created_at: '2014-04-11'

With this query I get what I want. Is there a better way of doing it?

SELECT
    k.name,
    array_agg(DISTINCT position),
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '1' day)) as y_position,
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '7' day)) as last_week,
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '1' month)) as last_month,
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '3' month)) as three_month,
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '6' month)) as six_month,
    (SELECT array_agg(DISTINCT position) FROM project_reports prk WHERE prk.project_id = pr.project_id AND prk.keyword_id = pr.keyword_id AND date(prk.created_at) = date(current_date - interval '12' month)) as last_year,
    date(pr.created_at)
FROM project_keywords pk
INNER JOIN keywords k ON pk.keyword_id = k.id
INNER JOIN project_reports pr USING (project_id, keyword_id)
GROUP BY k.name, date(pr.created_at), pr.project_id, pr.keyword_id

Best Answer

If I guessed the missing pieces correctly, this query using a crosstab() function should do the job (and much faster than the original - except for small cardinalities where not much can be gained to begin with):

It requires that you install the additional module tablefunc first (once per database. Detailed instructions in this related answer on SO:
PostgreSQL Crosstab Query

SELECT name, project_id, positions, created_day
      ,yesterday, last_week, last_month, three_month, six_month, last_year
FROM  (
   SELECT k.name, pr.keyword_id, pr.project_id, pr.created_at::date AS created_day
        , array_agg(DISTINCT pr.position) AS positions
   FROM   project_reports  pr
   JOIN   keywords         k  ON k.id = pr.keyword_id
   GROUP  BY k.id, pr.project_id, created_day -- k.id being the pk
   ) sub
LEFT JOIN crosstab(
    $$SELECT ARRAY[keyword_id, project_id], created_at::date, array_agg(DISTINCT position)
      FROM   project_reports
      WHERE  created_at::date IN (
                    current_date - 1
                  , current_date - 7
                  , (now() - interval '1 mon')::date
                  , (now() - interval '3 mon')::date
                  , (now() - interval '6 mon')::date
                  , (now() - interval '12 mon')::date
                  )
      GROUP  BY 1,2
      ORDER  BY 1,2$$
   ,$$VALUES 
       (current_date - 1)
      ,(current_date - 7)
      ,((now() - interval '1 mon')::date)
      ,((now() - interval '3 mon')::date)
      ,((now() - interval '6 mon')::date)
      ,((now() - interval '12 mon')::date)$$
   ) AS t (
        kp int[]
      , yesterday   int[]
      , last_week   int[]
      , last_month  int[]
      , three_month int[]
      , six_month   int[]
      , last_year   int[]
      ) ON t.kp[1] = keyword_id
       AND t.kp[2] = project_id;

To break it down, a simple demo of what I am doing in the crosstab:

SELECT * FROM crosstab(
 $$SELECT *
   FROM  (
      VALUES
        ('{1,2}'::int[], current_date - 1, '{1,2}'::int[])
      , ('{1,3}'::int[], current_date - 1, '{1,2}'::int[])
      , ('{1,2}', current_date - 7, '{5,6,7}'::int[])
      , ('{1,3}', current_date - 7, '{5,6,7}'::int[])
      , ('{1,2}', (now() - interval '1 mon')::date, '{9,10}'::int[])
       -- 3 mon missing
      , ('{1,2}', (now() - interval '6 mon')::date, '{6,6,6}'::int[])
       -- 12 mon missing
      ) sub
    ORDER BY 1,2$$
   
,$$VALUES 
    (current_date - 1)
   ,(current_date - 7)
   ,((now() - interval '1 mon')::date)
   ,((now() - interval '3 mon')::date)
   ,((now() - interval '6 mon')::date)
   ,((now() - interval '12 mon')::date)$$
) AS t (
     kp int[]
   , yesterday   int[]
   , last_week   int[]
   , last_month  int[]
   , three_month int[]
   , six_month   int[]
   , last_year   int[]
   );

Result:

  kp   | yesterday | last_week | last_month | three_month | six_month | last_year
-------+-----------+-----------+------------+-------------+-----------+-----------
 {1,2} | {1,2}     | {5,6,7}   | {9,10}     |             | {6,6,6}   |
 {1,3} | {1,2}     | {5,6,7}   |            |             |           |

The rest should be obvious.

Comments

I de-coupled the two parts and fused the plethora of correlated subqueries into a single subquery, which should be much faster. Then I use cross tabulation pull individual columns from this query. The syntax for crosstab() is the tricky part.
I am not joining to project_keywords, it seems to be just noise. I am suspecting problems in your relational model.
I seems you should include the project_id in the result to make sense. Else you should not GROUP BY it ...
You can just subtract integer from date.

Items 1 and 2

SELECT k.keyword_id
     , k.name
     , pr.project_id
     , COALESCE(min(pr.position), 0) AS pos
     , COALESCE(pr.created_at, now()::date) AS created_at
FROM   keyword             k 
LEFT   JOIN project_report pr USING (keyword_id)
GROUP  BY k.keyword_id, pr.project_id, pr.created_at
ORDER  BY keyword_id, created_at
;

In Postgres 9.1 or later the pk column covers the whole table in GROUP BY.
Use COALESCE to replace possible NULL values.

A guess at item 3

WITH cte AS (
   SELECT k.keyword_id
        , k.name
        , pr.project_id
        , COALESCE(min(pr.position), 0) AS pos
        , COALESCE(pr.created_at, now()::date) AS created_at
   FROM   keyword   k 
   LEFT   JOIN project_report pr USING (keyword_id)
   GROUP  BY k.keyword_id, pr.project_id, pr.created_at
   )
, x AS (
   SELECT DISTINCT ON (keyword_id, project_id) *
   FROM   cte
   ORDER  BY keyword_id, project_id, created_at DESC
   )
SELECT x.*
     , COALESCE(y.pos, 0) AS yesterday_pos 
     , COALESCE(w.pos, 0) AS week_pos 
     , COALESCE(m.pos, 0) AS month_pos 
FROM   x
LEFT   JOIN  cte y ON y.keyword_id = x.keyword_id
                  AND y.project_id = x.project_id
                  AND y.created_at = x.created_at - interval '1 day'
LEFT   JOIN  cte w ON w.keyword_id = x.keyword_id
                  AND w.project_id = x.project_id
                  AND w.created_at = x.created_at - interval '1 week'
LEFT   JOIN  cte m ON m.keyword_id = x.keyword_id
                  AND m.project_id = x.project_id
                  AND m.created_at = x.created_at - interval '1 month'
;

Explain

In CTE cte produce daily aggregates per (k.keyword_id, pr.project_id).
In CTE x pick the latest day per (k.keyword_id, pr.project_id).
In the outer query LEFT JOIN the latest day x to cte multiple times to retrieve past values from the same (k.keyword_id, pr.project_id) for 1 day / week / month earlier.

SQL Fiddle.

Mysql – Query monthly having group_concat and group by need help

Terminology and Methodology

This kind of transformation – rows to columns – is called pivoting. It is typical to pivot data simultaneously with their aggregation, as seems to be a requirement in your case too. In SQL you can do both operations as a single logical step. Other SQL products even offer special syntactical extensions for pivoting, but there is a way to do that using more generic syntax, one that is supported by at least every major RDBMS, which includes MySQL.

The method is called conditional aggregation, and you almost nailed it. The conditional, implemented as a CASE expression in your query, is supposed to go inside the aggregate function, and the criterion that the conditional is checking on (MONTH(e.date_start) in your case) needs to be excluded from the GROUP BY.

So, instead of

SELECT
  CASE WHEN MONTH(e.date_start) = 1 THEN GROUP_CONCAT(...),
  ...
FROM
  ...
GROUP BY
  MONTH(e.date_start),
  client

it should be

SELECT
  GROUP_CONCAT(CASE WHEN MONTH(e.date_start) = 1 THEN ...),
  ...
FROM
  ...
GROUP BY
  MONTH(e.date_start),
  client

The exclusion part may seem counter-intuitive – after all, you are intending to get monthly data. However, you should keep in mind that in SQL you are grouping rows. In your case, one row is one client – therefore, grouping should be by client only. You could say that monthly grouping is implicit, as it is implemented through conditional aggregation only.

Anyway, what about the last row? The last row is special, and not just because it is a rollup row, thus representing aggregated data over the entire set. In my view, it is special more because it contains entirely different data: counts instead of concatenated strings.

Based on that fact, it seems natural to me to consider a distinct logical step – a separate SELECT – for getting the last row's results. The two result sets would then be combined into one with the help of a UNION ALL operator. This approach would make the logic clear, in my opinion: different kinds of data in the output would be accounted for by different legs of the query. And clear logic ultimately means ease of maintenance.

Solution

So, taking into account all of the above, the complete query could look like this:

SELECT
  c.name AS client,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  1 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS jan,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  2 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS feb,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  3 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS mar,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  4 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS apr,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  5 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS may,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  6 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS jun,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  7 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS jul,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  8 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS aug,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN  9 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS sep,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN 10 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS oct,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN 11 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS nov,
  GROUP_CONCAT(CASE MONTH(e.date_start) WHEN 12 THEN CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) END SEPARATOR ',') AS `dec`
FROM
  event AS e
  INNER JOIN client AS c ON e.client_id = c.id
GROUP BY
  c.name

UNION ALL

SELECT
  NULL,
  COUNT(MONTH(e.date_start) =  1 OR NULL),
  COUNT(MONTH(e.date_start) =  2 OR NULL),
  COUNT(MONTH(e.date_start) =  3 OR NULL),
  COUNT(MONTH(e.date_start) =  4 OR NULL),
  COUNT(MONTH(e.date_start) =  5 OR NULL),
  COUNT(MONTH(e.date_start) =  6 OR NULL),
  COUNT(MONTH(e.date_start) =  7 OR NULL),
  COUNT(MONTH(e.date_start) =  8 OR NULL),
  COUNT(MONTH(e.date_start) =  9 OR NULL),
  COUNT(MONTH(e.date_start) = 10 OR NULL),
  COUNT(MONTH(e.date_start) = 11 OR NULL),
  COUNT(MONTH(e.date_start) = 12 OR NULL)
FROM
  event AS e

;

Or, perhaps, like this, if we wanted to make it look slightly less cumbersome by eliminating repetition of some code:

SELECT
  client,
  GROUP_CONCAT(CASE month WHEN  1 THEN item END SEPARATOR ',') AS jan,
  GROUP_CONCAT(CASE month WHEN  2 THEN item END SEPARATOR ',') AS feb,
  GROUP_CONCAT(CASE month WHEN  3 THEN item END SEPARATOR ',') AS mar,
  GROUP_CONCAT(CASE month WHEN  4 THEN item END SEPARATOR ',') AS apr,
  GROUP_CONCAT(CASE month WHEN  5 THEN item END SEPARATOR ',') AS may,
  GROUP_CONCAT(CASE month WHEN  6 THEN item END SEPARATOR ',') AS jun,
  GROUP_CONCAT(CASE month WHEN  7 THEN item END SEPARATOR ',') AS jul,
  GROUP_CONCAT(CASE month WHEN  8 THEN item END SEPARATOR ',') AS aug,
  GROUP_CONCAT(CASE month WHEN  9 THEN item END SEPARATOR ',') AS sep,
  GROUP_CONCAT(CASE month WHEN 10 THEN item END SEPARATOR ',') AS oct,
  GROUP_CONCAT(CASE month WHEN 11 THEN item END SEPARATOR ',') AS nov,
  GROUP_CONCAT(CASE month WHEN 12 THEN item END SEPARATOR ',') AS `dec`
FROM
  (
    SELECT
      c.name AS client,
      MONTH(e.date_start) AS month,
      CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) AS item
    FROM
      event AS e
      INNER JOIN client AS c ON e.client_id = c.id
  ) AS derived
GROUP BY
  client

UNION ALL

SELECT
  NULL,
  COUNT(month =  1 OR NULL),
  COUNT(month =  2 OR NULL),
  COUNT(month =  3 OR NULL),
  COUNT(month =  4 OR NULL),
  COUNT(month =  5 OR NULL),
  COUNT(month =  6 OR NULL),
  COUNT(month =  7 OR NULL),
  COUNT(month =  8 OR NULL),
  COUNT(month =  9 OR NULL),
  COUNT(month = 10 OR NULL),
  COUNT(month = 11 OR NULL),
  COUNT(month = 12 OR NULL)
FROM
  (
    SELECT
      MONTH(e.date_start) AS month
    FROM
      event AS e
  ) AS derived

;

If you are not very familiar with the A = B OR NULL formula, think of it as an alternative to CASE WHEN A = B THEN 1 ELSE NULL END. For details on how it really works I refer you to this Stack Overflow question:

Why do I need “OR NULL” in MySQL when counting rows with a condition

Alternative

All that nice talk about clear logic and maintainability notwithstanding, you might still like to be able to implement the query as a single SELECT. Even though we have managed to simplify the initial version by reducing code repetition, the MONTH(date_start) expression still has to be specified twice across the query, because each SELECT leg needs it, so why not try eliminating that repetition too? And if that is not enough of a reason, the alternative solution might turn out faster, maybe even noticeably. And maybe the resulting query would not look too ugly. In the end, it would just be nice to have a choice, plain and simple.

So, how would it be possible to rewrite the query using WITH ROLLUP, so that both the client details and the rollup row would be produced by the same SELECT statement (without any UNION ALL kind of cheating)?

Well, you can use the former solution as a prototype. One part of that query performs group concatenation over clients. The other part counts rows over the entire set. Now, if you want to have a single-part query, the single part must do both operations at both levels.

Which kind of information to show at which level should then be determined by another set of conditionals.

Taking the above-mentioned points into consideration, here is my attempt at a single-step query:

SELECT
  client,
  IF(client IS NULL, COUNT(month =  1 OR NULL), GROUP_CONCAT(CASE month WHEN  1 THEN item END SEPARATOR ',')) AS jan,
  IF(client IS NULL, COUNT(month =  2 OR NULL), GROUP_CONCAT(CASE month WHEN  2 THEN item END SEPARATOR ',')) AS feb,
  IF(client IS NULL, COUNT(month =  3 OR NULL), GROUP_CONCAT(CASE month WHEN  3 THEN item END SEPARATOR ',')) AS mar,
  IF(client IS NULL, COUNT(month =  4 OR NULL), GROUP_CONCAT(CASE month WHEN  4 THEN item END SEPARATOR ',')) AS apr,
  IF(client IS NULL, COUNT(month =  5 OR NULL), GROUP_CONCAT(CASE month WHEN  5 THEN item END SEPARATOR ',')) AS may,
  IF(client IS NULL, COUNT(month =  6 OR NULL), GROUP_CONCAT(CASE month WHEN  6 THEN item END SEPARATOR ',')) AS jun,
  IF(client IS NULL, COUNT(month =  7 OR NULL), GROUP_CONCAT(CASE month WHEN  7 THEN item END SEPARATOR ',')) AS jul,
  IF(client IS NULL, COUNT(month =  8 OR NULL), GROUP_CONCAT(CASE month WHEN  8 THEN item END SEPARATOR ',')) AS aug,
  IF(client IS NULL, COUNT(month =  9 OR NULL), GROUP_CONCAT(CASE month WHEN  9 THEN item END SEPARATOR ',')) AS sep,
  IF(client IS NULL, COUNT(month = 10 OR NULL), GROUP_CONCAT(CASE month WHEN 10 THEN item END SEPARATOR ',')) AS oct,
  IF(client IS NULL, COUNT(month = 11 OR NULL), GROUP_CONCAT(CASE month WHEN 11 THEN item END SEPARATOR ',')) AS nov,
  IF(client IS NULL, COUNT(month = 12 OR NULL), GROUP_CONCAT(CASE month WHEN 12 THEN item END SEPARATOR ',')) AS `dec`
FROM
  (
    SELECT
      c.name AS client,
      MONTH(e.date_start) AS month,
      CONCAT(e.id, '-', RIGHT(DATE(e.date_start), 5)) AS item
    FROM
      event AS e
      INNER JOIN client AS c ON e.client_id = c.id
  ) AS derived
GROUP BY
  client
  WITH ROLLUP
;

As you can see, the query is calculating both the COUNTs and the GROUP_CONCATs both at the client level and at the whole set level. But each pair of the results is put inside an IF function, so that ultimately only one or the other result is returned in each column.

The condition to check is client IS NULL. If client happens to be null, that means that the current group represents the entire set and in that case each IF function chooses the COUNT result. When client value is not null, that means we are at the client level and each group of rows represents a specific client. In that case the GROUP_CONCAT result is returned, which is according to requirements, because for clients we must show concatenated strings.

Both solutions can be found at dbfiddle.uk.

Additional remarks

In my explanation above I tried to focus on the solutions and how they work. To avoid distractions, I allowed certain anti-patterns in my code that are worth mentioning.

Reliance on precedence rules when implicitly converting data.

The results of the functions COUNT() and GROUP_CONCAT() are different. One returns an integer, the other a string. When you are trying to put values of such different types into a single column, the server must decide which type to convert to which other type. It is good to know these rules but you should never rely on them in production code. That is just bad practice.

In the queries above, a COUNT and a GROUP_CONCAT are either in the same column in different legs of the same query or in the same conditional that chooses between the two. In each case MySQL needs to apply its rules of type precedence. To avoid that, you can just explicitly cast each COUNT as a string:
```
CAST(COUNT(...) AS char)
```
Missing ORDER BY in GROUP_CONCAT.

If you omit an ORDER BY, you are simply saying that you do not care if one time the query returns the string as A,B,C and the other as B,A,C and later as C,B,A. If you want your results to be predictable, always specify an ORDER BY and always use enough criteria to avoid ties.

In the above queries the lines were simply very long already and I purposefully omitted the ORDER BYs for presentability's sake. The issue can be easily fixed with an ORDER BY like this:
```
ORDER BY item ASC
```
More specifically, in GROUP_CONCAT it would be used like this:
```
GROUP_CONCAT(CASE month WHEN 12 THEN item END ORDER BY item ASC SEPARATOR ',')
```