Postgresql – JSONB Array of Strings (with GIN index) versus Split Rows (B-Tree Index)

database-designindexperformancepostgresqlpostgresql-11postgresql-performance

I have a database which stores receiver to indicate which account the data relates to. This has led to tons of duplication of data, as one set of data may create 3 separate rows, where all column data is the same with the exception of the receiver column. While redesigning the database, I have considered using an array with a GIN index instead of the current B-Tree index on receiver.

Current table definition:

CREATE TABLE public.actions (
    global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),
    time timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    receiver text NOT NULL,
    tx_id text NOT NULL,
    block_num integer NOT NULL,
    contract text NOT NULL,
    action text NOT NULL,
    data jsonb NOT NULL
);

Indexes:

  • "actions_pkey" PRIMARY KEY, btree (global_sequence, time)
  • "actions_time_idx" btree (time DESC)
  • "receiver_idx" btree (receiver)

Field details:

  • Global sequence is a serially incrementing ID
  • Block number and time are not unique, but also incrementing
  • Global sequence and time are primary key, as the data is internally partitioned by time
    • There are some receivers that have over 1 billion associated actions (each with a unique global_sequence).
  • Average text lengths:
    • Receiver: 12
    • tx_id: 52
    • contract: 12
    • action: 6
    • data: small-medium sized JSONB with action metadata

Cardinality of 3 schema options:

  • Current: sitting at 4.2 billion rows in this table
  • Receiver as array: Would be at approximately 1.8 billion rows
  • Normalized: There would be 3 tables:
    • Actions: 1.8 billion rows
    • Actions_Accounts: 4.2 billion rows
    • Accounts: 500 000 rows

Common Query:

  • SELECT * FROM actions WHERE receiver = 'Alpha' ORDER BY time DESC LIMIT 100

All columns are required in the query. NULL values are not seen. I believe joins in the normalized schema would slow down & query speed is #1 priority)

Best Answer

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Related:

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
   receiver_id serial PRIMARY KEY
 , receiver    text NOT NULL -- UNIQUE?
);

CREATE TABLE action (  -- I shortened the name to "action"
   action_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),  -- ??
   time       timestamptz NOT NULL DEFAULT now(),
   block_num  int NOT NULL,
   tx_id      text NOT NULL,
   contract   text NOT NULL,
   action     text NOT NULL,
   data       jsonb NOT NULL
)

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM   receiver_action ra
JOIN   action a USING (action_id)
WHERE  ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER  BY a.time DESC
LIMIT  100;

Drawback: it's much harder to make your common query fast now. Related:

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , time        timestamptz NOT NULL DEFAULT now()  -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

And use this query:

SELECT a.*
FROM  (
   SELECT action_id
   FROM   receiver_action
   WHERE  receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
   ORDER  BY time DESC
   LIMIT  100
   )
JOIN   action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.