PostgreSQL – Select and Count Based on a Condition

postgresql

I have two tables, proj: id name and proj_reports: id, position, created_at, key_id, proj_id and
I want to select from key_reports for each proj_id:

- total number of `key_id` which can be done with `count(.key_id)`
- total number of DISTINCT `key_id` `count(DISTINCT .key_id)`
- total number of `key_id` where `position` > 1
- total number of `key_id` where `position` > 2 AND < 5

The problem that I have is that I don't know how to add those conditions for checking if the position is > or < a number or in a range.

Best Answer

You can use a CASE expression inside the COUNT aggregate funciton.

SELECT
    COUNT( CASE WHEN position > 1 THEN key_id ELSE NULL END ) AS GT1,
    COUNT( CASE WHEN position > 2 AND position < 5 THEN key_id ELSE NULL END ) AS GT2LT5,
FROM
     proj_reports

The COUNT aggregate does not count NULL values, so that's why it works.

Here is a SQL Fiddle that demonstrates the concept.

Items 1 and 2

SELECT k.keyword_id
     , k.name
     , pr.project_id
     , COALESCE(min(pr.position), 0) AS pos
     , COALESCE(pr.created_at, now()::date) AS created_at
FROM   keyword             k 
LEFT   JOIN project_report pr USING (keyword_id)
GROUP  BY k.keyword_id, pr.project_id, pr.created_at
ORDER  BY keyword_id, created_at
;

In Postgres 9.1 or later the pk column covers the whole table in GROUP BY.
Use COALESCE to replace possible NULL values.

A guess at item 3

WITH cte AS (
   SELECT k.keyword_id
        , k.name
        , pr.project_id
        , COALESCE(min(pr.position), 0) AS pos
        , COALESCE(pr.created_at, now()::date) AS created_at
   FROM   keyword   k 
   LEFT   JOIN project_report pr USING (keyword_id)
   GROUP  BY k.keyword_id, pr.project_id, pr.created_at
   )
, x AS (
   SELECT DISTINCT ON (keyword_id, project_id) *
   FROM   cte
   ORDER  BY keyword_id, project_id, created_at DESC
   )
SELECT x.*
     , COALESCE(y.pos, 0) AS yesterday_pos 
     , COALESCE(w.pos, 0) AS week_pos 
     , COALESCE(m.pos, 0) AS month_pos 
FROM   x
LEFT   JOIN  cte y ON y.keyword_id = x.keyword_id
                  AND y.project_id = x.project_id
                  AND y.created_at = x.created_at - interval '1 day'
LEFT   JOIN  cte w ON w.keyword_id = x.keyword_id
                  AND w.project_id = x.project_id
                  AND w.created_at = x.created_at - interval '1 week'
LEFT   JOIN  cte m ON m.keyword_id = x.keyword_id
                  AND m.project_id = x.project_id
                  AND m.created_at = x.created_at - interval '1 month'
;

Explain

In CTE cte produce daily aggregates per (k.keyword_id, pr.project_id).
In CTE x pick the latest day per (k.keyword_id, pr.project_id).
In the outer query LEFT JOIN the latest day x to cte multiple times to retrieve past values from the same (k.keyword_id, pr.project_id) for 1 day / week / month earlier.

SQL Fiddle.

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

A few thoughts. Consider range partitioning on "timestamp", it will reduce the amount of work your query have to do. Further optimization might be to calculate and store the agg for "closed" partitions. You will of course have to recalculate this when you modify historical information.

As a bonus it will be much easier to roll-out historical data that is no longer needed.

As mentioned this is just some thoughts and reflections, it may or it may not fit your situation. One pitfall may be the number of different intervals that you use for reporting. If larger intervals is not multiples of the partitioning interval this wont work well.

Edit: How is the plan affected by creating an temp table and an index on "timestamp"?

create index X on market_trades ( timestamp );

create temp table T ( time_range tsrange );

insert into T ( time_range )
WITH vals AS (
    SELECT '2013-08-19 0:00'::timestamp  AS frame_start,
       '2013-08-26 0:00'::timestamp  AS frame_end,
       '17h'::interval               AS interval_length
),   intervals AS (
    SELECT tsrange(start_time,
               lead(start_time, 1, frame_end) OVER (ORDER BY start_time   NULLS FIRST)) AS time_range
    FROM (
        SELECT generate_series(frame_start, frame_end, interval_length) AS  
               start_time, frame_end
        FROM vals
    ) _
    WHERE start_time < frame_end
)
SELECT time_range
FROM intervals i;

create index x on T ( time_range );

analyze t;

SELECT time_range, count(td.id) AS agg
FROM T
LEFT JOIN market_trades td
    ON td.timestamp <@ T.time_range
GROUP  BY T.time_range
ORDER  BY T.time_range;

Best Answer

Related Solutions

PostgreSQL – Select Data Based on Records Created_at Column

Items 1 and 2

A guess at item 3

Explain

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

Related Question