Postgresql – Postgres: Efficient schema for querying with one exact string match and two range matches

postgresql

The table I need to query:

CREATE TABLE regions (
    state text NOT NULL,
    zip_begin text NOT NULL,  -- 9-digit zip code
    zip_end text NOT NULL,
    date_begin date NOT NULL,
    date_end date,

    data ...,
)

There are ~50 states and between 0-5M rows per state. The zip and date ranges might overlap.

The following will be one of the most common queries in my OLTP application:

SELECT data
FROM regions
WHERE state = {state}
  AND {date} BETWEEN date_begin AND date_end
  AND {zip} BETWEEN zip_begin AND zip_end

This query usually yields one row, but may sometimes yield more.

From the Postgres docs, it sounds like a GiST index might do what I need, but I don't really understand how those work.

Best Answer

Don't dismiss the simple btree index until you try it. It surely won't scale well to a trillion rows, but you surely don't have a trillion rows. It might be good enough.

To use an gist index, you would need to treat the ranges explicitly as ranges, not as end points. It would probably be best to reformat your table in that format, but you can instead use expression indexes to reformat on the fly. But then your query will have to be written to match.

Some other complexities are that simple scalars don't have built in gist operators, so to put "state" into a gist index requires you to use the btree_gist extension; and there is not a built-in text range. Rather than creating a text-range, you can just cast the zip to int (which would probably be better done in the table in the first place), so something like this:

create index on regions using gist(
  state,
  daterange(date_begin,date_end,'[]'),
  int4range(zip_begin::int,zip_end::int,'[]')
);

And then the query would look like:

SELECT data
FROM regions
WHERE state = {state}
  AND {date}::date <@ daterange(date_begin,date_end,'[]')
  AND {zip}::int <@ int4range(zip_begin::int,zip_end::int,'[]')

Now I haven't tested this query on the gist index, because gist indexes are so ridiculously slow to build that I don't have one available yet to try it on. It has finished building and I have tested it, and it runs and gives the correct answer with '0' left-padded 5 digit zip codes. With the range sizes I used, it is sometimes faster and some times slower than the btree index on a 10e6 row table, but was never markedly different.