The advantage of JSON is versatility: you can add any keys without changing the table definition. And maybe convenience, if your application can read and write JSON directly.
Separate columns beat a combined json
or jsonb
column in every performance aspect and in several other aspects, too: More sophisticated type system, the full range of functionality (check, unique, foreign key constraints, default values, etc.), the table is smaller, indexes are smaller, queries are faster.
For prefix matching on text
columns you might use a text_pattern_ops
index:
Or, more generally, a trigram index supporting any LIKE
patterns:
While you stick with JSON (jsonb
in particular), there are also different indexing strategies. GIN or Btree is not the only decision to make. Partial indexes, expression indexes, different operator classes (in particular: jsonb_path_ops
) Related:
The optimal DB design always depends on the complete picture.
Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json
or jsonb
or even a plain array type in combination with a GIN index will most likely make it slower.
With your original table this multicolumn index with the right sort order should be a game changer for your common query:
CREATE INDEX game_changer ON actions (receiver, time DESC);
This way, Postgres can just pick the top 100 rows from the index directly. Super fast.
Related:
Your current indexes receiver_idx
and actions_time_idx
may lose their purpose.
Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?
CREATE TABLE receiver (
receiver_id serial PRIMARY KEY
, receiver text NOT NULL -- UNIQUE?
);
CREATE TABLE action ( -- I shortened the name to "action"
action_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
-- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass), -- ??
time timestamptz NOT NULL DEFAULT now(),
block_num int NOT NULL,
tx_id text NOT NULL,
contract text NOT NULL,
action text NOT NULL,
data jsonb NOT NULL
)
CREATE TABLE receiver_action (
receiver_id int REFERENCES receiver
, action_id bigint REFERENCES action
, PRIMARY KEY (receiver_id, action_id)
);
Also note the changed order of columns in table action
, saves a couple of bytes per row, which makes a couple of GB for billions of rows.
See:
Your common query changes slightly to:
SELECT a.*
FROM receiver_action ra
JOIN action a USING (action_id)
WHERE ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER BY a.time DESC
LIMIT 100;
Drawback: it's much harder to make your common query fast now. Related:
The quick (and slightly dirty) fix: include the time
column in table receiver_action
redundantly (or move it there).
CREATE TABLE receiver_action (
receiver_id int REFERENCES receiver
, action_id bigint REFERENCES action
, time timestamptz NOT NULL DEFAULT now() -- !
, PRIMARY KEY (receiver_id, action_id)
);
Create an index:
CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);
INCLUDE
requires Postgres 11 or later. See:
And use this query:
SELECT a.*
FROM (
SELECT action_id
FROM receiver_action
WHERE receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER BY time DESC
LIMIT 100
)
JOIN action a USING (action_id);
Depending on the exact story behind one set of data may create 3 separate rows
more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ...
But that's going in too deep. I tap out here.
Best Answer
Relational databases are designed around joins, and optimized to do them well.
Unless you have a good reason not to use a normalized design, use a normalised design.
jsonb
and things likehstore
are good for when you can't use a normalized data model, such as when the data model changes rapidly and is user defined.If you can model it relationally, model it relationally. If you can't, consider json etc. If you're choosing between json/jsonb/hstore, generally choose jsonb unless you have a reason not to.
That's what I said in my blog post, which addresses just this topic. Please read the whole post. The paragraph you quoted points out that if you're choosing a dynamic structure you should choose jsonb over hstore, but the rest of the blog post is about why you should usually prefer to model relationally if you can.
So. Model the main structured part relationally. If the tables are really wide with lots of columns, this might be a sign that further normalization is required. Do not be afraid of joins. Learn to love joins. Joining many small tables will often be faster than querying and maintaining big denormalized tables. Denormalize only if you need to for specific cases, and preferably via materialized views ... but don't do it until you know you need to and have an actual concrete problem to solve.
For user-contributed data that's freeform and unstructured, use jsonb. It should perform as well as hstore, but it's more flexible and easier to work with.
One relevant thing to understand: GiST and GIN indexes like those used on jsonb are generally much less efficient than a plain b-tree index. They're more flexible, but a b-tree index on a normal column will almost always be much, much faster.