Postgresql – Index for finding combinations of attributes in JSON array

arrayindexjsonpostgresql

Based on this related question on SO.

This works fine if you only want simple matches. Suppose tracks have both Artist and Title field in the JSON data. So we have something like:

INSERT INTO tracks (id, data)
VALUES (1, '[{"artist": "Simple Plan", "title": "Welcome to My Life"}]');

We create the index like in the original question:

CREATE INDEX tracks_artists_gin_idx ON tracks
USING GIN (json2arr(data, 'artist'));

CREATE INDEX tracks_title_gin_idx ON tracks
USING GIN (json2arr(data, 'title'));

So now we have two fields to match. As you can see, if we perform the original query (with very naive modifications) of:

SELECT *
FROM   tracks
WHERE  '{"ARTIST NAME"}'::text[] <@ (json2arr(data, 'artist'))
AND    '{"TITLE"}'::text[]       <@ (json2arr(data, 'title'))

This will give the wrong answer because the indices of the array of artist and title in the JSON array do not have to match for this query to match something in the JSON. What is the proper way of doing this query so we can get the exact match we need? Does json2arr() need to be changed?

Edit: Why this is wrong

Suppose our table has records like:

INSERT INTO tracks (id, data)  VALUES
  (1, '[{"artist": "Simple Plan",    "title": "Welcome to My Life"}]')
, (2, '[{"artist": "Another Artist", "title": "Welcome to My Life"},
        {"artist": "Simple Plan",    "title": "Perfect"}]');

If you query like:

SELECT *
FROM   tracks
WHERE  '{"Simple Plan"}'::text[]        <@ (json2arr(data, 'artist'))
AND    '{"Welcome to my Life"}'::text[] <@ (json2arr(data, 'title'))

Both records will be matched (both record 1 and 2), even though you really only wanted the first record.

Best Answer

Why?

The expressed requirement that artist and title must match in the same element of the JSON array is not reflected in your query, which finds all rows where at least one element matches the artist and another (possibly a different one) matches the title.

The example data for your first case was inconclusive, since the query cannot fail this way for a single array element. Your second example demonstrates the case well, though.

Solution in Postgres 9.3

There are various ways to fix this. One way would be to translate each json array element to an SQL array of composite type consisting of artist and title and match the whole type as one.

Another way would be to keep indexes like you have now (or even a single composite spanning both expressions: (json2arr(data, 'artist'), json2arr(data, 'title')). Your current query identifies possible matches. Unnest the json array data for all identified candidates and check whether both artist and title match on the same element. May or may not be efficient enough.

Better use jsonb in n Postgres 9.4

I am not going into detail for json / pg 9.3, because the release of jsonb in Postgres 9.4 with advanced indexing capabilities mostly obsoleted the problem. This can be implemented much simpler and more efficient with a native GIN index on a jsonb column. There are various options. To optimize for the presented case:

CREATE TABLE tracks (id serial, data jsonb);
INSERT INTO tracks (id, data)  VALUES
  (1, '[{"artist": "Simple Plan",    "title": "Welcome to My Life"}]')
, (2, '[{"artist": "Another Artist", "title": "Welcome to My Life"},
        {"artist": "Simple Plan",    "title": "Perfect"}]');

Index:

CREATE INDEX tracks_data_gin_idx ON tracks USING gin (data jsonb_path_ops);

Query:

SELECT * FROM tracks
WHERE  data @> '[{"artist": "Simple Plan", "title": "Welcome to My Life"}]';

Sequence of attributes in the JSON value and insignificant whitespace don't matter for jsonb. I added detailed information for jsonb to the referenced answer on SO: