Postgresql – Confused about indexing for IS NOT NULL

indexpostgresql

I have a PostgreSQL table with a timestamp field processed. I run a lot of queries like this:

SELECT blah1, blah2 FROM tbl WHERE processed IS NOT NULL

According to this question, I can create a partial index like this to speed this up:

CREATE INDEX idx_tbl_processed
    ON tbl (processed)
    WHERE processed IS NOT NULL;

My question is, with this partial index, wouldn't the database have to look up rows with processed not null (which is not fully indexed, so must be slow) before it gets to the indexed part of the table (which in this case is processed itself!)?

In general, does a partial index conditioned on some_column make queries on some_column itself faster?

Best Answer

My question is, with this partial index, wouldn't the database have to look up rows with processed not null (which is not fully indexed, so must be slow) before it gets to the indexed part of the table (which in this case is processed itself!)?

No, an index on IS NOT NULL has no need in itself to visit the table on queries restricted to IS NOT NULL or on queries which can be determined to be not-null. In fact, on queries restricted to the column where the value IS NOT NULL, you could potentially be doing an INDEX ONLY SCAN which I demonstrate here.

CREATE TABLE foo AS
  SELECT x::int, null::int AS bar
  FROM generate_series(1,1e6)
    AS gs(x);
UPDATE foo
  SET bar = 42
  WHERE x = 185093;


CREATE INDEX ON foo (bar)
  WHERE bar IS NOT NULL;

VACUUM ANALYZE foo;

EXPLAIN ANALYZE SELECT * FROM foo WHERE bar = 42;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Scan using foo_bar_idx on foo  (cost=0.12..8.14 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=1)
   Index Cond: (bar = 42)
 Planning time: 0.295 ms
 Execution time: 0.047 ms
(4 rows)

Related Solutions

Mysql – Indexing is not working

The order you put the conditions in the WHERE clause doesn't matter. But whether the conditions are = or <> or IS NULL does matter in combination with the order of columns in composite indexes. To clear this out:

--------- condition ----------   ----- index can be used ----   -- index search --
column = @some_constant_value          yes                        index seek
column IS NULL                         yes                        index seek

column > @some_constant_value    yes but no further columns       index scan
column < @some_constant_value    yes but no further columns       index scan
column >= @some_constant_value   yes but no further columns       index scan
column <= @some_constant_value   yes but no further columns       index scan

column <> @some_constant_value   yes but no further columns       full index scan
column IS NOT NULL               yes but no further columns       full index scan
columnA = columnB                yes if both columns are in it    full index scan

So, for your first query which has a condition:

WHERE category IS NOT NULL 
  AND per_nr IS NOT NULL 
  AND nid IS NOT NULL

whatever order you put the columns in the composite index, a full index scan will be needed. The optimizer, seeing that the selectivity of this operation will be very low (meaning that even after the index scan it will still have a very high percentage of the rows passing the condition and all those rows will have to be retrieved from the table, decides to do a full scan of the table. And that is probably the best way to execute the query.

For the second query which has a condition:

WHERE per_nr > 47 
  AND source_category = 'Cereal Grains and Pasta' 
  AND food.nid <> 205014 
  AND food.nid IS NOT NULL

we'll get rid of the redundant food.nid IS NOT NULL (if it is different than 205014, it cannot be NULL anyway) and we'll order the conditions according to the above chart:

WHERE source_category = 'Cereal Grains and Pasta' 
  AND per_nr > 47 
  AND nid <> 205014

This gives us a very useful hint about how to create the composite index:

(source_category)               : good index, it will be used to efficiently (index seek)
                                         narrow the search to rows 
                                         with 'Cereal Grains and Pasta' only
(source_category, per_nr)       : even better, it will be used to efficiently narrow the
                                         search to rows with 'Cereal Grains and Pasta' 
                                         and per_nr > 47. Then these rows will be read 
                                         from the table and the rows that have 
                                         (nid = 205014 or nid IS NULL) will be excluded 
(source_category, per_nr) INCLUDE (nid)
                                : even more better than the above index,
                                         it will also exclude the rows that have 
                                         (nid = 205014 or nid IS NULL) using the index
   -- !! --   Unfortunately, MySQL has not (yet) implemented `INCLUDE` indexes, 
   -- !! --       so the only improvement can be
(source_category, per_nr, nid)  : slightly better than the 2nd index,
                                         it will also exclude the rows that have 
                                         (nid = 205014 or nid IS NULL) using the index

So, add the 4th (or the 2nd) index:

ALTER TABLE food
  ADD INDEX source_category__per_nr__nid__IX   -- choose a name for the index
    (source_category, per_nr, nid) ;

if you want the second query to be executed efficiently:

SELECT name, nid, brand, cuisine, type, subcuisine, 
       course, per_nr, source, source_category 
FROM food 
WHERE per_nr > 47 
  AND source_category = 'Cereal Grains and Pasta' 
  AND nid <> 205014 ;

Postgresql – Create index on a Postgres table as fast as possible

If it's an option, you could pre-sort the data externally, before database import.

In my test using PostgreSQL 9.6.1, a table with three double precision columns and 10M records with random values, creating an index on the same data, but preordered, shaved off 70% of index creation time:

db=# create table indexing_test_data (a varchar(50), b varchar(50), c real);       
CREATE TABLE
Time: 3,586 ms
db=# insert into indexing_test_data select random()::text, random()::text, random() from generate_series(1, 10000000);
INSERT 0 10000000
Time: 25590,475 ms
db=# select a, b, c into indexing_test_sorted from indexing_test_data order by a, b, c;
SELECT 10000000
Time: 77389,665 ms

db=# create index test_data_idx on indexing_test_data (a, b, c);
CREATE INDEX
Time: 57399,140 ms

db=# create index test_sorted_idx on indexing_test_sorted (a, b, c);
CREATE INDEX
Time: 16219,639 ms

An extra speedup can be had if you can afford to use a database with C locale and collation, which can then use a feature called abbreviated keys, available since PostgreSQL 9.5. This speeds up sorts up to 20x, but only works on C locale, due to buggy locale support in older libraries:

db=# create database testdb lc_collate "C" lc_ctype "C" template template0;
CREATE DATABASE
Time: 429,710 ms
db=# \c testdb
You are now connected to database "testdb" as user "user".
testdb=# create table indexing_test_data (a varchar(50), b varchar(50), c real);       
CREATE TABLE
Time: 2,794 ms
testdb=# insert into indexing_test_data select random()::text, random()::text, random() from generate_series(1, 10000000);
INSERT 0 10000000
Time: 25977,964 ms
testdb=# select a, b, c into indexing_test_sorted from indexing_test_data order by a, b, c;
SELECT 10000000
Time: 20794,850 ms

testdb=# create index test_data_idx on indexing_test_data (a, b, c);
CREATE INDEX
Time: 16371,426 ms

testdb=# create index test_sorted_idx on indexing_test_sorted (a, b, c);
CREATE INDEX
Time: 8046,787 ms

Here, indexing sorted data takes around 50% time of unsorted, but compared to first version, that took 57 seconds, you're down to 14% of time.

Other things, as mentioned in the comments, can help too: setting maintenance_work_mem appropriately and not running create index concurrently (which is significantly slower).

Best Answer

Related Solutions

Mysql – Indexing is not working

Postgresql – Create index on a Postgres table as fast as possible

Related Question