Postgresql – Optimizing a query which runs through all hours in a day

optimizationperformancepostgresqlquery-performance

I need help optimizing a query. I'm using PostgreSQL 9.3.4 currently but can upgrade to 9.4 if needed.

I have a table with 60+ millions of records which looks like this:

                                                          Table "public.snapshots"
   Column   |           Type           |                         Modifiers                          | Storage  | Stats target | Description 
------------+--------------------------+------------------------------------------------------------+----------+--------------+-------------
 id         | integer                  | not null default nextval('snapshots_new_id_seq'::regclass) | plain    |              | 
 camera_id  | integer                  | not null                                                   | plain    |              | 
 created_at | timestamp with time zone | not null                                                   | plain    |              | 
 notes      | text                     |                                                            | extended |              | 
 data       | bytea                    | not null                                                   | extended |              | 
 is_public  | boolean                  | not null default false                                     | plain    |              | 
Indexes:
    "snapshots_new_created_at_camera_id_index" UNIQUE, btree (created_at, camera_id)
Foreign-key constraints:
    "snapshots_new_camera_id_fkey" FOREIGN KEY (camera_id) REFERENCES cameras(id) ON DELETE CASCADE
Has OIDs: no

There can be between 0 and 3600 snapshot records in an hour. For this query I'm interested only in knowing which hours in a given day for a given camera_id have 1 or more snapshot records (the actual count is unimportant).

Currently the application is implemented to execute one query for each hour in the day, like this:

SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-24 23:00:00 UTC') AND ("created_at" <= '2015-05-24 23:59:59 UTC'));
SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-25 00:00:00 UTC') AND ("created_at" <= '2015-05-25 00:59:59 UTC'));
...
SELECT count(*) AS "count" FROM "snapshots" WHERE (("snapshots"."camera_id" = 4809) AND ("created_at" >= '2015-05-25 22:00:00 UTC') AND ("created_at" <= '2015-05-25 22:59:59 UTC'));

Explain analyze for one query: http://explain.depesz.com/s/9tbP

I tried optimizing this and ended up with one query which seemed like it was what I needed:

SELECT count(*) AS "count" FROM "snapshots" WHERE (("created_at" > '2015-05-06 23:00:00.000000+0000') AND ("created_at" < '2015-05-08 23:00:00.000000+0000')) GROUP BY date_trunc('hour', created_at);

Explain analyze for this query: http://explain.depesz.com/s/cVUK

But this is actually 10x times slower than the above 24 queries combined.

What am I doing wrong? Should I forget COUNT since I'm not interested in the actual count? How would the query look like then?

Edit: Thanks for all the comments and answers, I learned multiple things from you! I wish I could accept all three answers, I chose ypercube's as it seems most efficient and flexible.

Best Answer

A variation on @Akash's answer which uses the LATERAL syntax and results in a better execution plan (check the Index Only Scan using idx2_snapshots on snapshots in the plan below):

SELECT 
      hour AS start_hour,
      hour + interval '1 hour' AS end_hour
FROM 
      generate_series('2015-01-01'::timestamp, 
                      '2015-01-02 23:00:00'::timestamp, 
                      '1 hour') AS hour 
    , LATERAL
        ( SELECT 1 
          FROM snapshots 
          WHERE camera_id = 3 
            AND created_at >= hour 
            AND created_at < hour + interval '1 hour' 
          LIMIT 1 
        ) AS x ;

Tested in SQLfiddle (query 2). The plan:

Nested Loop (cost=0.43..4083.77 rows=1000 width=8)
-> Function Scan on generate_series hour (cost=0.00..10.00 rows=1000 width=8)
-> Limit (cost=0.43..4.05 rows=1 width=0)
-> Index Only Scan using idx2_snapshots on snapshots (cost=0.43..7740.98 rows=2139 width=0)
Index Cond: ((camera_id = 3) AND (created_at >= hour.hour) AND (created_at < (hour.hour + '01:00:00'::interval)))