PostgreSQL Index Optimization – Comparing Day of Week and Hour

datetimeindex-tuningpostgresqlpostgresql-9.6

I have a table of sensor observations with obs_ts timestamp, sensor_id text, sensor_val int, and to fill in gaps with data we have models by day of week and hour of day: model_id int, hour_of_day int, model_val int.

To gap fill missing values, we would join these two tables through a crossover table that is: sensor_id text, day_of_week int, model_id int

If our observations table is massive, what would be an optimal way of indexing it for joining on isodow and hour. Does indexing a timestamp also index functions like EXTRACT(isodow FROM obs_ts) or should I make those functional indexes explicit, e.g. CREATE INDEX ON observations (EXTRACT isodow FROM obs_tx). For joining on hour, would it be better to convert the hour_of_day to a timerange?

Best Answer

Per Verace's suggestion I created a test table with 10M records. The TLDR: Indexes on datetime functions need to be explicit, any conversion of that datetime column for joining/filtering will not be served by an index.

SELECT obs_ts AS weekday_indexed, obs_ts AS hour_indexed , * 
INTO index_test
FROM observations
LIMIT 10000000;

CREATE INDEX ON index_test(EXTRACT('isodow' from weekday_indexed));
CREATE INDEX ON index_test(EXTRACT('hour' FROM hour_indexed));
CREATE INDEX ON index_test(obs_ts);
ANALYSE index_test;

Table size is 1.7 GB. All of the indexes are 214MB. From my tests the index on obs_ts is never used (probably because these comparisons involve changing the data-type of the timestamp column...). The function specific indexes are used, so there is definitely a performance-index size tradeoff.

Day of Week

SELECT COUNT(1)
FROM index_test
WHERE extract('isodow' from obs_ts )=5
--3 secs 276 msec.

SELECT COUNT(1)
FROM index_test
WHERE extract('isodow' from weekday_indexed)=5
--1 secs 586 msec.

Hour of Day

SELECT COUNT(1)
FROM index_test
WHERE extract('hour' from obs_ts )=5
-- Total query runtime: 2 secs 420 msec.

EXPLAIN ANALYZE SELECT COUNT(1)
FROM index_test
WHERE obs_ts::TIME >= '05:00' AND obs_ts::TIME < '06:00'
-- Total query runtime: 2 secs 391 msec.

SELECT COUNT(1)
FROM index_test
WHERE tx::TIME @> timerange('05:00', '06:00')
-- Total query runtime: 1 secs 928 msec.

SELECT COUNT(1)
FROM index_test
WHERE extract('hour' from hour_indexed)=5
-- 779 msec.

Related Solutions

Mysql – Index optimization and cardinality

The query has a lot of bloat:

Two group by, the second is completely unnecessary.
Three levels of nesting, the last is also completely unnecessary.
It uses GROUP_CONCAT() to find all the product_id that have rows with 'southern' and rows with 'proper'. Not the best way in my opinion.
The 1189 AS query_id seems to return redundant information (the same value in all the rows of the result.)
Using UNION ALL instead of OR may be ok for efficiency but the group by in the next step is probably the performance killer (grouping 400K rows without the use of any index).

Things that are not a problem:

Using where; Using temporary; Using filesort This looks bad but sorting 46 rows is not going to be an issue. using filesort in MySQL Explain does not mean that a file is used (not a very good naming choice indeed).

One thing you can do without changing the query:

add an index on (store_id, entity_id, data_index, weight).
This will basically create a copy of the entire table (but a bit narrower as the primary index in InnoDB tables uses some extra hidden space for internal purposes). The query will be able to use the index, doing a partial index scan (only the rows with store_id = 1) and then get all the info needed from the index (no table scan at all). Unfortunately, as the the query is written now, it will still need 2 partial index scans (and then all the unnecessary bloat that follows).

If you can change the query though, there are various options to try:

First, we could replace UNION ALL with OR so we only do one table scan (or one index scan if we have the index suggested above) and not two. Also remove the the levels of nesting and only have one, and only one GROUP BY by altering the query logic (removing the GROUP_CONCAT() and using HAVING with counts and sums):

SELECT 1189 AS query_id,
       i.entiity_id AS product_id,
       SUM(CASE WHEN i.data_index LIKE '%southern%' THEN i.weight ELSE 0 END) 
     + SUM(CASE WHEN i.data_index LIKE '%proper%' THEN i.weight ELSE 0 END)
         AS weight
FROM activo_advancedsearch_weighted_search AS i
WHERE i.store_id = 1
  AND ( i.data_index LIKE '%southern%'
     OR i.data_index LIKE '%proper%'
      )
GROUP BY i.entity_id
HAVING SUM(i.data_index LIKE '%southern%') > 0
   AND SUM(i.data_index LIKE '%proper%') > 0
ORDER BY weight DESC;

Postgresql – How to find the day of week of a date-time field in postgres

The whole issue revolves around CASTing and quirky PostgreSQL syntax.

CASTing strings as DATEs:

It's very simple - your first line of code:

ntest=# SELECT EXTRACT(DOW FROM '2011-07-07');
ERROR:  function pg_catalog.date_part(unknown, unknown) is not unique
LINE 1: SELECT EXTRACT(DOW FROM '2011-07-07');
               ^
HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
ntest=#

does not work. This is because, as far as PostreSQL is concerned, '2011-07-07' is just a string literal, not a DATE. This does work:

ntest=# SELECT EXTRACT(DOW FROM DATE '2011-07-07');
 date_part 
-----------
         4
(1 row)

Note the extra DATE that CASTs the string as such. You can also CAST it this way:

ntest=# SELECT EXTRACT(DOW FROM '2011-07-07'::DATE);
 date_part 
-----------
         4

This double colon (::) is very much a PostreSQL thing. It's really quite "obvious", once you grasp the PostgreSQL syntax (it had me stumped for a couple of minutes!)

A date field:

ntest=# \d orders;
       Table "public.orders"
   Column    |  Type   | Modifiers 
-------------+---------+-----------
 ldate       | date    | 
 customer_id | integer | 
 amount      | integer |

So, we have a date field - with records as below:

ntest=# SELECT * FROM orders;
   ldate    | customer_id | amount 
------------+-------------+--------
 2009-10-08 |           3 |   3000
 2009-10-08 |           3 |   1500
 2009-11-20 |           2 |   1560
 2008-05-20 |           4 |   2060
(4 rows)

So, to get the "day of week" from a date (note the two FROMs - doesn't work without the FROM between brackets):

ntest=# SELECT EXTRACT(DOW FROM ldate) FROM orders;
 date_part 
-----------
         4
         4
         5
         2
(4 rows)

Now, for a timestamp field:

ntest=# \d timestamptest;
           Table "public.timestamptest"
 Column |            Type             | Modifiers 
--------+-----------------------------+-----------
 x      | timestamp without time zone |

And:

ntest=# SELECT * FROM timestamptest;
             x              
----------------------------
 2016-04-18 22:52:30.020979
(1 row)

And DOW (again, note the two FROMs):

ntest=# SELECT EXTRACT(DOW FROM x) FROM timestamptest;
 date_part 
-----------
         1

Quirky syntax (the double colon also OK, if you want to be really explicit), but it works fine once you've mastered it.