Postgresql – Generate a table that shows a monthly series and the number of tasks done per month

postgresql

I'm trying to generate a table that will show the number of tasks that were completed in each month.

The table has the following columns:

  • id
  • job_name
  • job_status
  • date_completed

My goal is to query the number of completed jobs per month between two values (e.g. 2021-02-01 [Feb 2021] to 2022-02-28 [Feb. 2021]), with the condition that it will only count those with status value '1' (1 = complete, 0 = ongoing) with the query producing this table:

Months   | Jobs Completed
---------+-----------------------
Feb 2020 | 0
Mar 2020 | 2
Apr 2020 | 0
...
Jan 2022 | 1
---------+------------------------

I managed to generate the series and have values to it, but the said table has the following problems:

  1. The order of the months are not arranged in ascending order (From Feb 2021 – Jan 2022, the generated series was random)
  2. The value of the months (the "jobs_completed") copies the highest value, rather than show their own. (e.g. if there are 3 jobs done in March 2021, and 0 on other months, all of the months will register their "jobs_completed" value as 3)

I have made a db-fiddle to show my problem: https://www.db-fiddle.com/f/gH7C7R6udMucjSATpw8Erh/0

SELECT to_char(generate_series('2021-02-01'::DATE, '2022-01-31'::DATE, '1 month'), 'Mon YYYY') AS month,
            COUNT(tbl.job_status) FILTER(WHERE job_status = 1) as jobs_completed
FROM monthly_tasks tbl
WHERE tbl.date_completed BETWEEN '2021-01-01' AND '2021-12-01'
GROUP BY month;

Thank you in advance!

Best Answer

The order of the months are not arranged in ascending order

To sort by date, the ORDER BY should be applied on the date or timestamp column. If the ORDER BY is applied on the result of to_char(), which is text, then the rows will be sorted alphabetically... and since August starts with an A, it will be first.

So, using the test tables from nbk's answer:

SELECT 
  to_char(year_month,'MON YYYY') as ym_text,
  coalesce(jobs_completed,0) AS jobs_completed
 FROM 
  (SELECT date_trunc('month',generate_series('2021-02-01'::DATE, '2022-01-31'::DATE, '1 month')) AS year_month) m
LEFT JOIN
  (SELECT
    date_trunc('month',date_completed) year_month,
    COUNT(*) as jobs_completed
  FROM monthly_tasks
  WHERE job_status = 1
  AND date_completed >='2021-02-01' AND date_completed < '2022-01-31'
  GROUP BY year_month) counts USING (year_month)
  ORDER BY year_month;

 ym_text  | jobs_completed
----------+----------------
 FEB 2021 |              0
 MAR 2021 |              2
 APR 2021 |              0
 MAY 2021 |              0
 JUN 2021 |              0
 JUL 2021 |              0
 AUG 2021 |              0
 SEP 2021 |              0
 OCT 2021 |              0
 NOV 2021 |              0
 DEC 2021 |              0
 JAN 2022 |              1

Another, muuuch simpler version using a subquery:

SELECT 
  to_char(year_month,'MON YYYY') as ym_text,
  (SELECT count(*) FROM monthly_tasks WHERE job_status=1
   AND date_completed >= year_month 
   AND date_completed < year_month + '1 month'::INTERVAL)
  FROM (SELECT date_trunc('month',generate_series('2021-02-01'::DATE, '2022-01-31'::DATE, '1 month')) AS year_month) m
  ORDER BY year_month;

You can fill the table with test data:

INSERT INTO monthly_tasks (date_completed, job_status) SELECT '2021-01-01'::DATE + '1 hour'::INTERVAL*generate_series(1,24*3650), (random()>0.5)::integer;
CREATE INDEX monthly_tasks_date_status ON monthly_tasks ( date_completed , job_status );
VACUUM ANALYZE monthly_tasks;

...and test query performance with EXPLAIN ANALYZE

First query: 5ms

First query with "date_completed >='2021-02-01' AND date_completed < '2022-01-31'" removed from the JOIN: 45ms

Second query: 1.8ms

Note the query planner isn't smart enough to figure that this returns 12 rows:

generate_series('2021-02-01'::DATE, '2022-01-31'::DATE, '1 month')

but it is smart enough to figure out that this one does:

'2021-02-01'::DATE + '1 MONTH'::INTERVAL*generate_series(0, 11))

...probably because the arguments are integers. So the latter may result in faster queries.