You are right in that your example query would not use that index.
The query planner will consider using an index if:
- all the fields contained in it are referenced in the query
- some of the fields starting from the beginning are referenced
It will not be able to make use of indexes that start with a field not used by the query.
So for your example:
SELECT [id], [name], [customerId], [dateCreated]
FROM Representatives WHERE customerId=1
ORDER BY dateCreated
it would consider indexes such as:
[customerId]
[customerId], [dateCreated]
[customerId], [dateCreated], [name]
but not:
[name], [customerId], [dateCreated]
If it found both [customerId]
and [customerId], [dateCreated], [name]
its decision to prefer one over the other would depend on the index stats which depend on estimates of the balance of data in the fields. If [customerId], [dateCreated]
were defined it should prefer that over the other two unless you give a specific index hint to the contrary.
It is not uncommon to see one index defined for every field in my experience either, though this is rarely optimal as the extra management needed to update the indexes on insert/update, and the extra space needed to store them, is wasted when half of them may never get used - but unless your DB sees write-heavy loads the performance is not going to stink badly even with the excess indexes.
Specific indexes for frequent queries that would otherwise be slow due to table or index scanning is generally a good idea, though don't overdo it as you could be exchanging one performance issue for another. If you do define [customerId], [dateCreated]
as an index, for example, remember that the query planner will be able to use that for queries that would use an index on just [customerId]
if present. While using just [customerId]
would be slightly more efficient than using the compound index this may be mitigated by ending up having two indexes competing for space in RAM instead of one (though if your entire normal working set fits easily into RAM this extra memory competition may not be an issue).
Why?
The expressed requirement that artist
and title
must match in the same element of the JSON array is not reflected in your query, which finds all rows where at least one element matches the artist
and another (possibly a different one) matches the title
.
The example data for your first case was inconclusive, since the query cannot fail this way for a single array element. Your second example demonstrates the case well, though.
Solution in Postgres 9.3
There are various ways to fix this. One way would be to translate each json array element to an SQL array of composite type consisting of artist
and title
and match the whole type as one.
Another way would be to keep indexes like you have now (or even a single composite spanning both expressions: (json2arr(data, 'artist'), json2arr(data, 'title'))
. Your current query identifies possible matches. Unnest the json array data
for all identified candidates and check whether both artist
and title
match on the same element. May or may not be efficient enough.
Better use jsonb
in n Postgres 9.4
I am not going into detail for json
/ pg 9.3, because the release of jsonb
in Postgres 9.4 with advanced indexing capabilities mostly obsoleted the problem. This can be implemented much simpler and more efficient with a native GIN index on a jsonb
column. There are various options. To optimize for the presented case:
CREATE TABLE tracks (id serial, data jsonb);
INSERT INTO tracks (id, data) VALUES
(1, '[{"artist": "Simple Plan", "title": "Welcome to My Life"}]')
, (2, '[{"artist": "Another Artist", "title": "Welcome to My Life"},
{"artist": "Simple Plan", "title": "Perfect"}]');
Index:
CREATE INDEX tracks_data_gin_idx ON tracks USING gin (data jsonb_path_ops);
Query:
SELECT * FROM tracks
WHERE data @> '[{"artist": "Simple Plan", "title": "Welcome to My Life"}]';
Sequence of attributes in the JSON value and insignificant whitespace don't matter for jsonb
. I added detailed information for jsonb to the referenced answer on SO:
Best Answer
Depends on the data distribution. I can easily construct examples for both cases (index vs table scan).
Just think about the case when the columns are unique or nearly unique (index), or when all the rows are the same (table scan).
Technically your index can be used for the above query.
Given the general nature of a column named
ID
, I would say the index will be chosen, but it is not guaraanteed.The predicate
DATE_TIME is not null
will be processed at the table level, not the index level, even if you make sure that the NULL rows are indexed as well: