MySQL, PostgreSQL, BigQuery – Count Missing Values Across Columns and Join Back to Original Table

google-bigqueryMySQLpostgresql

Here is the table:

I want to count the missing values across each row for t1, t2, t3,…, and create another column in the same table with the results as shown in the picture.

I can easily do this in something like python pandas. But on SQL (specifically BigQuery and/or postgreSQL, MySQL), I can't seem to figure out the syntax. Here is my attempt:

select
  array_agg(id),
  array_agg(date),
  count(t)
from
(
  select id, date, t1 as t from testdata
  union all
  select id, date, t2 as t from testdata 
  union all
  select id, date, t3 as t from testdata   
) as data
group by id, date, t

Any ideas where I am going wrong? I think union all unpivots the wide to long table, but how do I count the missing values within a specific range and join all the results back to the original table?

Best Answer

   SELECT id, date, t1, t2, t3
     CASE WHEN t1 IS NULL THEN 1 ELSE 0 END 
     + CASE WHEN t2 IS NULL THEN 1 ELSE 0 END 
     + CASE WHEN t3 IS NULL THEN 1 ELSE 0 END AS missing_across_all_cols
   FROM testdata

Related Solutions

Postgresql – Postgres: “Pivot” unioned table based on one column

If it is faster you have to measure for yourself. However, doing it on the db side sends less data across, so I would assume it to be faster.

The pivot itself is fairly simple. I put your query's result in a table to make the example simpler.

SQL Fiddle

PostgreSQL 9.1.9 Schema Setup:

CREATE TABLE your_query
    ("rel_id" int, "timestamp" timestamp, "y" varchar(1))
;

INSERT INTO your_query
    ("rel_id", "timestamp", "y")
VALUES
    (1, '2013-01-01 00:00:00', 'a'),
    (1, '2013-01-02 00:00:00', 'b'),
    (1, '2013-01-03 00:00:00', 'c'),
    (1, '2013-01-04 00:00:00', 'd'),
    (2, '2013-01-01 00:00:00', 'e'),
    (2, '2013-01-04 00:00:00', 'f'),
    (2, '2013-01-06 00:00:00', 'g')
;

First step is to only return one row per date. That is simply done with a group by:

Query 1:

SELECT timestamp
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |
|--------------------------------|
| January, 01 2013 00:00:00+0000 |
| January, 02 2013 00:00:00+0000 |
| January, 03 2013 00:00:00+0000 |
| January, 04 2013 00:00:00+0000 |
| January, 06 2013 00:00:00+0000 |

Now wee need to pull the "correct" value into each column. For that we combine an aggregate with a case. The case returns null for all rows for which the condition is not met. the aggregate ignores nulls. That leaves the one value we are looking for:

Query 2:

SELECT timestamp,
       MAX(CASE WHEN rel_id = 1 THEN y END ) AS "1",
       MAX(CASE WHEN rel_id = 2 THEN y END ) AS "2"
  FROM your_query
 GROUP BY timestamp
 ORDER BY timestamp

Results:

|                      TIMESTAMP |      1 |      2 |
|--------------------------------|--------|--------|
| January, 01 2013 00:00:00+0000 |      a |      e |
| January, 02 2013 00:00:00+0000 |      b | (null) |
| January, 03 2013 00:00:00+0000 |      c | (null) |
| January, 04 2013 00:00:00+0000 |      d |      f |
| January, 06 2013 00:00:00+0000 | (null) |      g |

To make this work with your original query just replace your_query in my example with

(
(SELECT rel_id, timestmap, y FROM table_1 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

(SELECT rel_id, timestamp, y FROM table_2 AS full_
WHERE full_.timestamp BETWEEN %s AND %s
ORDER BY full_.timestamp)

UNION ALL

...
) AS your_query

Postgresql – Join query result of 2 different tables based on common column

You can do it with:

SELECT   u.id, u.given_name, u.family_name, u.email, COUNT(1) as num
FROM     users u JOIN tasks t
          ON (u.id=t.user_id)
WHERE    t.created_at BETWEEN %s and %s
GROUP BY u.id, u.given_name, u.family_name, u.email, (DAYS_AGO_7_CONFIG,YESTERDAY))

To address the question in the comments, you can add a HAVING clause after GROUP BY to limit the results to those where num is greater than 5:

HAVING   COUNT(1)>5

Best Answer

Related Solutions

Postgresql – Postgres: “Pivot” unioned table based on one column

Postgresql – Join query result of 2 different tables based on common column

Related Question