The order you put the conditions in the WHERE
clause doesn't matter. But whether the conditions are =
or <>
or IS NULL
does matter in combination with the order of columns in composite indexes. To clear this out:
--------- condition ---------- ----- index can be used ---- -- index search --
column = @some_constant_value yes index seek
column IS NULL yes index seek
column > @some_constant_value yes but no further columns index scan
column < @some_constant_value yes but no further columns index scan
column >= @some_constant_value yes but no further columns index scan
column <= @some_constant_value yes but no further columns index scan
column <> @some_constant_value yes but no further columns full index scan
column IS NOT NULL yes but no further columns full index scan
columnA = columnB yes if both columns are in it full index scan
So, for your first query which has a condition:
WHERE category IS NOT NULL
AND per_nr IS NOT NULL
AND nid IS NOT NULL
whatever order you put the columns in the composite index, a full index scan will be needed. The optimizer, seeing that the selectivity of this operation will be very low (meaning that even after the index scan it will still have a very high percentage of the rows passing the condition and all those rows will have to be retrieved from the table, decides to do a full scan of the table. And that is probably the best way to execute the query.
For the second query which has a condition:
WHERE per_nr > 47
AND source_category = 'Cereal Grains and Pasta'
AND food.nid <> 205014
AND food.nid IS NOT NULL
we'll get rid of the redundant food.nid IS NOT NULL
(if it is different than 205014
, it cannot be NULL
anyway) and we'll order the conditions according to the above chart:
WHERE source_category = 'Cereal Grains and Pasta'
AND per_nr > 47
AND nid <> 205014
This gives us a very useful hint about how to create the composite index:
(source_category) : good index, it will be used to efficiently (index seek)
narrow the search to rows
with 'Cereal Grains and Pasta' only
(source_category, per_nr) : even better, it will be used to efficiently narrow the
search to rows with 'Cereal Grains and Pasta'
and per_nr > 47. Then these rows will be read
from the table and the rows that have
(nid = 205014 or nid IS NULL) will be excluded
(source_category, per_nr) INCLUDE (nid)
: even more better than the above index,
it will also exclude the rows that have
(nid = 205014 or nid IS NULL) using the index
-- !! -- Unfortunately, MySQL has not (yet) implemented `INCLUDE` indexes,
-- !! -- so the only improvement can be
(source_category, per_nr, nid) : slightly better than the 2nd index,
it will also exclude the rows that have
(nid = 205014 or nid IS NULL) using the index
So, add the 4th (or the 2nd) index:
ALTER TABLE food
ADD INDEX source_category__per_nr__nid__IX -- choose a name for the index
(source_category, per_nr, nid) ;
if you want the second query to be executed efficiently:
SELECT name, nid, brand, cuisine, type, subcuisine,
course, per_nr, source, source_category
FROM food
WHERE per_nr > 47
AND source_category = 'Cereal Grains and Pasta'
AND nid <> 205014 ;
If it's an option, you could pre-sort the data externally, before database import.
In my test using PostgreSQL 9.6.1, a table with three double precision columns and 10M records with random values, creating an index on the same data, but preordered, shaved off 70% of index creation time:
db=# create table indexing_test_data (a varchar(50), b varchar(50), c real);
CREATE TABLE
Time: 3,586 ms
db=# insert into indexing_test_data select random()::text, random()::text, random() from generate_series(1, 10000000);
INSERT 0 10000000
Time: 25590,475 ms
db=# select a, b, c into indexing_test_sorted from indexing_test_data order by a, b, c;
SELECT 10000000
Time: 77389,665 ms
db=# create index test_data_idx on indexing_test_data (a, b, c);
CREATE INDEX
Time: 57399,140 ms
db=# create index test_sorted_idx on indexing_test_sorted (a, b, c);
CREATE INDEX
Time: 16219,639 ms
An extra speedup can be had if you can afford to use a database with C locale and collation, which can then use a feature called abbreviated keys, available since PostgreSQL 9.5. This speeds up sorts up to 20x, but only works on C locale, due to buggy locale support in older libraries:
db=# create database testdb lc_collate "C" lc_ctype "C" template template0;
CREATE DATABASE
Time: 429,710 ms
db=# \c testdb
You are now connected to database "testdb" as user "user".
testdb=# create table indexing_test_data (a varchar(50), b varchar(50), c real);
CREATE TABLE
Time: 2,794 ms
testdb=# insert into indexing_test_data select random()::text, random()::text, random() from generate_series(1, 10000000);
INSERT 0 10000000
Time: 25977,964 ms
testdb=# select a, b, c into indexing_test_sorted from indexing_test_data order by a, b, c;
SELECT 10000000
Time: 20794,850 ms
testdb=# create index test_data_idx on indexing_test_data (a, b, c);
CREATE INDEX
Time: 16371,426 ms
testdb=# create index test_sorted_idx on indexing_test_sorted (a, b, c);
CREATE INDEX
Time: 8046,787 ms
Here, indexing sorted data takes around 50% time of unsorted, but compared to first version, that took 57 seconds, you're down to 14% of time.
Other things, as mentioned in the comments, can help too: setting maintenance_work_mem
appropriately and not running create index concurrently
(which is significantly slower).
Best Answer
No, an index on
IS NOT NULL
has no need in itself to visit the table on queries restricted toIS NOT NULL
or on queries which can be determined to be not-null. In fact, on queries restricted to the column where the valueIS NOT NULL
, you could potentially be doing anINDEX ONLY SCAN
which I demonstrate here.