PostgreSQL – JSONB with Indexing vs Hstore

database-designpostgresql

I am trying to decide on the database design, with as few assumptions (regarding how the web app actually evolves) as possible at this stage.

As a first step, understanding that JOINS are expensive, I am considering a small number of monolithic tables as opposed to a large number of normalized smaller tables. As a second point, I am confused between using hstore vs. regular tables vs. JSONB (with GiST indexing).

AFAIK (please feel free to correct):

Generally, in Postgres, hstore is known to perform better than other data types. This presentation from FOSDEM PGDAY has some interesting stats (in the second half of the slides).
https://wiki.postgresql.org/images/b/b4/Pg-as-nosql-pgday-fosdem-2013.pdf
An advantage with hstore is the fast indexing (GiN or GiST). However, with JSONB, GiN and GiST indexing can also be applied to JSON data.
This blog from a professional at 2nd Quadrant says "At this point it’s probably worth replacing hstore use with jsonb in all new applications" (scroll to the end):
http://blog.2ndquadrant.com/postgresql-anti-patterns-unnecessary-jsonhstore-dynamic-columns/

So I would like to decide on the following:

For the main (structured) part of the data: should it go in a couple of
relational tables (relatively large with many columns), or should it be a number of key-value stores using hstore?
For the ad hoc (user contributed/unstructured) data, should it be in JSON or ad hoc key value stores in hstore (with the keys stored in one of the main relational tables)?

Best Answer

Relational databases are designed around joins, and optimized to do them well.

Unless you have a good reason not to use a normalized design, use a normalised design.

jsonb and things like hstore are good for when you can't use a normalized data model, such as when the data model changes rapidly and is user defined.

If you can model it relationally, model it relationally. If you can't, consider json etc. If you're choosing between json/jsonb/hstore, generally choose jsonb unless you have a reason not to.

That's what I said in my blog post, which addresses just this topic. Please read the whole post. The paragraph you quoted points out that if you're choosing a dynamic structure you should choose jsonb over hstore, but the rest of the blog post is about why you should usually prefer to model relationally if you can.

So. Model the main structured part relationally. If the tables are really wide with lots of columns, this might be a sign that further normalization is required. Do not be afraid of joins. Learn to love joins. Joining many small tables will often be faster than querying and maintaining big denormalized tables. Denormalize only if you need to for specific cases, and preferably via materialized views ... but don't do it until you know you need to and have an actual concrete problem to solve.

For user-contributed data that's freeform and unstructured, use jsonb. It should perform as well as hstore, but it's more flexible and easier to work with.

One relevant thing to understand: GiST and GIN indexes like those used on jsonb are generally much less efficient than a plain b-tree index. They're more flexible, but a b-tree index on a normal column will almost always be much, much faster.

Related Solutions

PostgreSQL Index – Efficient Searches on JSON Data: GIN vs Multiple Indexed Columns

The advantage of JSON is versatility: you can add any keys without changing the table definition. And maybe convenience, if your application can read and write JSON directly.

Separate columns beat a combined json or jsonb column in every performance aspect and in several other aspects, too: More sophisticated type system, the full range of functionality (check, unique, foreign key constraints, default values, etc.), the table is smaller, indexes are smaller, queries are faster.

For prefix matching on text columns you might use a text_pattern_ops index:

Why would you index text_pattern_ops on a text column?

Or, more generally, a trigram index supporting any LIKE patterns:

While you stick with JSON (jsonb in particular), there are also different indexing strategies. GIN or Btree is not the only decision to make. Partial indexes, expression indexes, different operator classes (in particular: jsonb_path_ops) Related:

How to get particular object from jsonb array in PostgreSQL?

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
   receiver_id serial PRIMARY KEY
 , receiver    text NOT NULL -- UNIQUE?
);

CREATE TABLE action (  -- I shortened the name to "action"
   action_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),  -- ??
   time       timestamptz NOT NULL DEFAULT now(),
   block_num  int NOT NULL,
   tx_id      text NOT NULL,
   contract   text NOT NULL,
   action     text NOT NULL,
   data       jsonb NOT NULL
)

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM   receiver_action ra
JOIN   action a USING (action_id)
WHERE  ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER  BY a.time DESC
LIMIT  100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , time        timestamptz NOT NULL DEFAULT now()  -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM  (
   SELECT action_id
   FROM   receiver_action
   WHERE  receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
   ORDER  BY time DESC
   LIMIT  100
   )
JOIN   action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

Best Answer

Related Solutions

PostgreSQL Index – Efficient Searches on JSON Data: GIN vs Multiple Indexed Columns

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

Related Question