PostgreSQL – How to Define a GIN Index Across Two Columns/Arrays

postgresql

Let's say we have a Postgres table contacts, each record having a bunch of labeled email addresses (pairs of labels and emails)– one of which is the "primary".

This is stored like:

id primary key
email text
email_label text
metadata jsonb
- emails array
- email text
- label text

For example, a record might look something like:

{
  id: 1,
  email: 'a@a.com',
  email_label: 'a',
  metadata: {
    emails: [
      {
        email: 'b@b.com',
        label: 'b'
      },
      {
        email: 'c@c.com',
        label: 'c'
      }
    ]
  }
}

Given this storage pattern, we want to be able to find a record by any of its email addresses.

The naive query would look like:

SELECT id
FROM contacts
WHERE
  email = 'my@email.com' OR
  metadata -> 'emails' @> '[{"email": "my@email.com"}]'

Is there any way to create an index that speeds up this operation significantly? It would need to update automatically in response to changes to the records, and ideally index across both the text column and the nested JSONB column.

The specific use case here would be able to do lookups by email address efficiently and quickly, without overhauling this structure or creating a new relational table.

I believe the solution involves using a GIN index and this question mentions jsonb_path_ops but I'm not sure how to wrangle all the pieces together.

Best Answer

Since you want to run the query on metadata->'emails', you need to build the index on that expression:

create index on contacts using gin ((metadata->'emails'));

Alternatively, you could make the index on metadata itself, and re-formulate the query condition like this:

... OR metadata @> '{"emails":[{"email": "my@email.com"}]}'

In either case, you also need an index on the column "email", to satisfy the other branch of the OR condition. Then the two indexes will be combined with a BitmapOr operation.

It is possible you could make one index on both columns using some complicated expression index, but that doesn't really seem worthwhile.

Related Solutions

Postgresql – Unique across tables

To enforce unique email addresses, I would remove all competing email columns and store them in one central email table for all active emails. And another table for deleted emails:

CREATE TABLE users (
  user_id  serial PRIMARY KEY
, username text UNIQUE NOT NULL
, email    text UNIQUE -- FK added below  -- can also be NOT NULL
);

CREATE TABLE email (
  email    text PRIMARY KEY
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
, UNIQUE (user_id, email)  -- seems redundant, but required for FK
);

ALTER TABLE users ADD CONSTRAINT users_primary_email_fkey
FOREIGN KEY (user_id, email) REFERENCES email (user_id, email);

CREATE TABLE email_deleted (
  email_id serial PRIMARY KEY
, email    text NOT NULL  -- not necessarily unique
, user_id  int NOT NULL REFERENCES users ON DELETE CASCADE
);

This way:

Active emails are unique, enforced by the PK constraint of email.
Each user can have any number of active and deleted emails, but ...
Each user can only have one primary email.
Every email is always owned by one user and is deleted with the user.
To soft-delete an email (without losing it and its affiliation to its user, move the row from email to email_deleted.
- The primary email of a user cannot be deleted this way, because the primary email must not be deleted.
I designed the FK constraint users_primary_email_fkey to span (user_id, email), which seems redundant at first. But this way the primary email can only be an email that is actually owned by the same user.
Due to the default MATCH SIMPLE behavior of FK constraints, you can still enter a user without primary email, because the FK constraint is not enforced if any of the columns is null.
Details:
- Two-column foreign key constraint only when third column is NOT NULL

The UNIQUE constraint on users.email is redundant for this solution, but it may be useful for other reasons. The automatically created index should come in handy (for instance for the last query in this answer).

The only thing that's not enforced this way is that every user has a primary email. You can do this, too. Add NOT NULL constraint to users.email

UNIQUE (user_id, email) is required for the FK constraint:

How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

You have doubtless spotted the circular reference in the above model. Contrary to what one might expect, this just works.

As long as users.email can be NULL, it's trivial:

INSERT user without email.
INSERT email referencing the owning user_id.
UPDATE user to set it's primary email if applicable.

It even works with users.email set to NOT NULL. You have to insert user and email at the same time though:

WITH u AS (
   INSERT INTO users(username, email)
   VALUES ('user_foo', 'foo@mail.com')
   RETURNING email, user_id
   )
INSERT INTO email (email, user_id)
SELECT email, user_id
FROM   u;

IMMEDIATE FK constraints (the default) are checked at the end of each statement. The above is one statement. That's why it works where two separate statements would fail. Detailed explanation:

How to deal with mutually recursive inserts

To get all emails of a user as array, with the primary email first:

SELECT u.*, e.emails
FROM   users u
     , LATERAL (
      SELECT ARRAY (
      SELECT email
      FROM   email
      WHERE  user_id = u.user_id
      ORDER  BY (email <> u.email)  -- sort primary email first
      ) AS emails
   ) e
WHERE  user_id = 1;

You could create a VIEW with this for ease of use.
LATERAL requires Postgres 9.3. use a correlated subquery in pg 9.2:

SELECT *, ARRAY (
             SELECT email
             FROM   email
             WHERE  user_id = u.user_id
             ORDER  BY (email <> u.email)  -- sort primary email first
             ) AS emails
FROM   users u
WHERE  user_id = 1;

To soft-delete an email:

WITH del AS (
   DELETE FROM email
   WHERE  email = 'spam@mail.com'
   RETURNING email, user_id
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM del;

To soft-delete the primary email of a given user:

WITH upd AS (
   UPDATE users u
   SET    email = NULL
   FROM   (SELECT user_id, email FROM users WHERE user_id = 123 FOR UPDATE) old
   WHERE  old.user_id = u.user_id
   AND    u.user_id = 1
   RETURNING old.*
   )
,    del AS (
   DELETE FROM email
   USING  upd
   WHERE  email.email = upd.email
   )
INSERT INTO email_deleted (email, user_id)
SELECT email, user_id FROM upd;

Details:

Return pre-UPDATE Column Values Using SQL Only - PostgreSQL Version

Quick test for all of the above: SQL Fiddle.

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
   receiver_id serial PRIMARY KEY
 , receiver    text NOT NULL -- UNIQUE?
);

CREATE TABLE action (  -- I shortened the name to "action"
   action_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),  -- ??
   time       timestamptz NOT NULL DEFAULT now(),
   block_num  int NOT NULL,
   tx_id      text NOT NULL,
   contract   text NOT NULL,
   action     text NOT NULL,
   data       jsonb NOT NULL
)

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM   receiver_action ra
JOIN   action a USING (action_id)
WHERE  ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER  BY a.time DESC
LIMIT  100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , time        timestamptz NOT NULL DEFAULT now()  -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM  (
   SELECT action_id
   FROM   receiver_action
   WHERE  receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
   ORDER  BY time DESC
   LIMIT  100
   )
JOIN   action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

Best Answer

Related Solutions

Postgresql – Unique across tables

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

Related Question