Postgresql – JSONB Array of Strings (with GIN index) versus Split Rows (B-Tree Index)

database-designindexperformancepostgresqlpostgresql-11postgresql-performance

I have a database which stores receiver to indicate which account the data relates to. This has led to tons of duplication of data, as one set of data may create 3 separate rows, where all column data is the same with the exception of the receiver column. While redesigning the database, I have considered using an array with a GIN index instead of the current B-Tree index on receiver.

Current table definition:

CREATE TABLE public.actions (
    global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),
    time timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    receiver text NOT NULL,
    tx_id text NOT NULL,
    block_num integer NOT NULL,
    contract text NOT NULL,
    action text NOT NULL,
    data jsonb NOT NULL
);

Indexes:

"actions_pkey" PRIMARY KEY, btree (global_sequence, time)
"actions_time_idx" btree (time DESC)
"receiver_idx" btree (receiver)

Field details:

Global sequence is a serially incrementing ID
Block number and time are not unique, but also incrementing
Global sequence and time are primary key, as the data is internally partitioned by time
- There are some receivers that have over 1 billion associated actions (each with a unique global_sequence).
Average text lengths:
- Receiver: 12
- tx_id: 52
- contract: 12
- action: 6
- data: small-medium sized JSONB with action metadata

Cardinality of 3 schema options:

Current: sitting at 4.2 billion rows in this table
Receiver as array: Would be at approximately 1.8 billion rows
Normalized: There would be 3 tables:
- Actions: 1.8 billion rows
- Actions_Accounts: 4.2 billion rows
- Accounts: 500 000 rows

Common Query:

SELECT * FROM actions WHERE receiver = 'Alpha' ORDER BY time DESC LIMIT 100

All columns are required in the query. NULL values are not seen. I believe joins in the normalized schema would slow down & query speed is #1 priority)

Best Answer

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
   receiver_id serial PRIMARY KEY
 , receiver    text NOT NULL -- UNIQUE?
);

CREATE TABLE action (  -- I shortened the name to "action"
   action_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),  -- ??
   time       timestamptz NOT NULL DEFAULT now(),
   block_num  int NOT NULL,
   tx_id      text NOT NULL,
   contract   text NOT NULL,
   action     text NOT NULL,
   data       jsonb NOT NULL
)

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM   receiver_action ra
JOIN   action a USING (action_id)
WHERE  ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER  BY a.time DESC
LIMIT  100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , time        timestamptz NOT NULL DEFAULT now()  -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM  (
   SELECT action_id
   FROM   receiver_action
   WHERE  receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
   ORDER  BY time DESC
   LIMIT  100
   )
JOIN   action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Use GIN to index bit strings

In a search, I would like to get all the rows that exactly match the bit string.

Use a B-Tree index, the default type. I don't see a case for a GIN index here.

Up to 1000 bits result in up to 133 bytes (or slightly more) storage size on disk for a bit varying type.

SELECT pg_column_size(repeat('1', 1000)::varbit)  -- 133

Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following tricks improve performance.

If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, the overhead would eat the benefit. But could make a big difference for big tables.

Example

Given table:

CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);

If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression b_col::bit(10). Casting to bin(n) truncates the bitstring to n bit.

CREATE INDEX tbl_b_col10_idx ON tbl ((b_col::bit(10)))

Extra parentheses are required for the cast operator in an index definition. See:

How to create an index on an integer json property in postgres

Then, instead of the query

SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit; -- 16 bit

You would use:

SELECT *
FROM   tbl
WHERE  b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- filter to exact match

Be aware that shorter values are padded with 0's to the right (least significant bits) when cast to bit(n).

In a real world application this starts to make sense with several 100s of bits. Test for the turning point.

Optimize further

Since most installations operate with a MAXALIGN of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:

 4 bytes item identifier
 8 bytes for the index tuple header (or 23 + 1 byte for heap tuples)
 ? actual space for data
 ? padding to the nearest multiple of 8 bytes

Plus some minor overhead per page and index / table. Details in the manual or in this related answer on SO.

Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to bigint and build an index on this expression.

CREATE INDEX tbl_b_col64_idx ON tbl ((b_col::bit(64)::bigint))

I cast twice (b_col::bit(64)::bigint) for there is no cast defined between varbit and bigint. Details in this related answer on SO:

Convert hex in text representation to decimal number

Effectively, this is just a very fast and simple hash function, where the hash value also allows to look up ranges of values. Depending on exact requirements you could go one step further and use any IMMUTABLE hash function - like md5(). Details in the answer linked above.

The query to go along with that:

SELECT *
FROM   tbl
WHERE  b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- narrow down to exact match

The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:

The index typically returns much fewer hits (64 bit of information vs. 10 bit)
Postgres can work with integer arithmetic, which should be faster, even for a plain = operation. (Didn't test to verify that.)
The type integer has no overhead like varbit - 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).
Effectively, to keep the index at its minimum size, you can only pack 24 bit into a varbit index - compared to 64 bit of information for a bigint index.

`CLUSTER`

In such a case CLUSTER should improve performance:

CLUSTER TABLE tbl USING tbl_b_col10_idx;

It's a one-time operation and has to be repeated at intervals of your design. Be sure to read the manual on CLUSTER if you want to use that. Or consider the alternative pg_repack. Details:

Configuring PostgreSQL for read performance

If the first 64 bit of your values are unique most of the time, CLUSTER will barely help, since the index scan will return a single row in most cases. If not, CLUSTER will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Use GIN to index bit strings

Example

Optimize further

CLUSTER

Related Question

`CLUSTER`