Can PostgreSQL Use Nulls in Indexes?

indexindex-tuningpostgresqlpostgresql-9.1

I've been reading this book which says that

The database assumes that Indexed_Col IS NOT NULL covers too large a range to be useful, so the database will not drive to an index from this condition.

I recognize that the book is more than 10 years old, but it has already proven quite useful–Using instructions gleaned from its pages, I've sped a query up by a factor of ten.

Further, in running EXPLAIN ANALYZE on a SELECT query, I've found that none of my indexes are being used, even when by all rights, they ought to be.

Thus, my question is:

Supposing there is a table which has a column, whose column definition includes "NOT NULL", and that an index exists which covers this column, would this index be used in a query of that table where the columns is part of the query?

Like:

CREATE TABLE my_table(
a varchar NOT NULL
);

CREATE INDEX ix_my_table ON my_table(a);

SELECT a from my_table;

Best Answer

PostgreSQL certainly can use an index for IS NOT NULL. I don't see any query planner assumptions about that condition, either.

If the null fraction for the column (pg_statistic.stanullfrac) is low enough to suggest that the index is usefully selective for the query, PostgreSQL will use an index.

I can't figure out what you're trying to say with:

If this is correct, is my understanding that an index on a column defined as "NOT NULL" not be used in a query which uses that column?

Certainly an index won't get used for an IS NOT NULL condition on a NOT NULL column. It'd always match 100% of rows, so a seqscan will almost always be much faster.

PostgreSQL won't use an index if the index doesn't filter out a large proportion of rows for a query. The only likely exception is when you're asking for a set of columns covered by a single index, in an order matching that of the index. PostgreSQL might do an index-only scan then. E.g. if there's an index on t(a, b, c) and you:

select a, b FROM t ORDER BY a, b, c;

PostgreSQL might use your index, even though no rows are filtered out, because it only has to read the index and can skip reading the heap, avoid doing a sort, etc.

Query

This does what your current query currently does, just simpler and faster:

SELECT p.id, p.note, p.photo, p."createdAt",
  u.id           AS "user.id",
  u.name         AS "user.name",
  h.id           AS "hashtags.id",
  h.count        AS "hashtags.count",
  h.name         AS "hashtags.name",
  h."createdAt"  AS "hashtags.createdAt",
  h."updatedAt"  AS "hashtags.updatedAt",
  h."objectId"   AS "hashtags.objectId",
  h._etl         AS "hashtags._etl",
  ph.id          AS "hashtags.PostHashtag.id",
  ph."createdAt" AS "hashtags.PostHashtag.createdAt",
  ph."updatedAt" AS "hashtags.PostHashtag.updatedAt",
  ph."postId"    AS "hashtags.PostHashtag.postId",
  ph."hashtagId" AS "hashtags.PostHashtag.hashtagId",
  ph."objectId"  AS "hashtags.PostHashtag.objectId",
  ph._etl        AS "hashtags.PostHashtag._etl"
FROM (
    SELECT id, note, photo, "createdAt", "userId"
    FROM   "Posts" p
    WHERE  photo IS NOT NULL
    AND    EXISTS (
        SELECT 1
        FROM   "PostHashtags" ph
        WHERE  ph."postId" = p.id
        )
    ORDER  BY p."createdAt" DESC
    LIMIT  10
   ) p
JOIN   "PostHashtags" ph ON ph."postId" = p.id
JOIN   "Hashtags"     h  ON h.id = ph."hashtagId"
JOIN   "Users"        u  ON u.id = p."userId"
WHERE  h.name = 'dead'
ORDER  BY p."createdAt" DESC;

The EXISTS semi-join should be faster than your subquery construct. I am assuming that the column "PostHashtags".id is the PK and cannot be NULL by itself. Also, if referential integrity is enforced by a FK constraint, no need to join to "Hashtags" in this test.

Indexes

Partial index on `Posts`

CREATE INDEX posts_foo_idx ON "Posts" ("createdAt", id)
WHERE photo IS NOT NULL;

Note the columns: ("createdAt", id). Postgres will the latest posts, I expect an index scan on posts_foo_idx from the top, followed by a test for matching entries in PostHashtags using id with the next index.

UNIQUE index on `PostHashtags`

This time we need the index with "postId" first.

The rest is mostly like in the first answer.

MySQL – Read-Only Engine Type and Indexes

If you solve it by adding a secondary index then imho MyISAM or InnoDB won't matter much, as both use B-Tree indexes.

Append both additional columns to the index: (col3, col1, col2) or (col3, col2, col1) should behave the same for your query.

The reason to add all the columns is that such index covers your query - all data needed to resolve the query are contained inside the index, so it is not needed to read the actual rows from the table. That means the data you want to fetch are grouped together on disk (the index is ordered/clustered by col3) and no random reads to the "main" table are needed.

If you had index on only (col3) then each row may be stored in a different part of the table (depending on your primary key or order of inserts) and it would randomly jump all over the place, incurring IO penalty.

If you have no primary key, then MyISAM would be probably better, as InnoDB would create internal primary key and the table would take more space than needed.

But there is a possibility to solve it without using any secondary key. Use InnoDB table like

create table t (
   col1 varchar(5),
   col2 varchar(5),
   col3 varchar(32) not null,
   col4 smallint unsigned not null auto_increment,
   primary key(col3, col4),
   key(col4)
) engine = innodb;

That way the table will use InnoDB feature of clustering by primary key - effectively forcing the optimal order and grouping of rows described previously. And it will take just a small bit more space for the additional column (2 bytes only, if you are positive there will always be less than 60k rows) and much smaller secondary key to accomodate the auto_increment behavior instead of effectively having 2 copies of the table as with the other version.

Example: http://sqlfiddle.com/#!9/c36cfc/2

Best Answer

Related Solutions

PostgreSQL – Best Indexes for Optimizing Query with JOIN Depth of 2

Query

Indexes

Partial index on Posts

UNIQUE index on PostHashtags

MySQL – Read-Only Engine Type and Indexes

Related Question

Partial index on `Posts`

UNIQUE index on `PostHashtags`