Postgresql – Better jsonb schema for querying in postgres

indexpostgresqlschema

I have a JSON with the following structure:

uid | item
1   |[{"id":1, "m":"123", "s":1},{"id":2, "msg":"on","s":2},{"id":3, "msg":"of","s":1}]
2   |[{"id":1, "m":"yes", "s":2},{"id":3, "msg":"gh","s":0}]
3   |[{"id":1, "m":"qa", "s":1},{"id":4, "msg":"ks"},{"id":5, "m":"test"}]

I want to query this table based on id and get rows based on matched object values. For e.g. for id=3 I want:

uid | id | m  | s
 1  | 3  | of | 1
 2  | 3  | gh | 0

This table has over 500M rows so I would need an index. Using GIN(item jsonb_path_ops) with item @> '[{"id": 3}]' works, but I don't know how to get the exact json object from array that matched.

I can turn the JSON structure to:

{
    "1": { "m":"123", "s": 1 },
    "2": { "m":"on", "s":2 },
    "3": { "m":"of", "s": 1 }
}

and use GIN(item) and check for key exists with ? and then try to fetch the key value. But the index size might increase which might be an overkill given that I only want to search based on id.

Maybe I can use a B-Tree index but not sure how. What should be my JSON structure for most efficient querying? and what index should be used so that I get only the matched object from the array?

Some facts:

  1. Cardinality of id <<< uid. (id = ~100,000, uid = ~500,000,000)
  2. Normalizing this turns this table to 500M*10,000 rows which is extremely slow.
  3. Each JSON array will have upto 50 objects only.

Best Answer

There are 2 questions here;


First one is the index structure.

  • You can use gin(item) here (without jsonb_path_ops). It will be just fine. You are querying for key (id) only not the full path.
  • You can create an expression index (or an gin index on a generated column) for extracting id field values to an integer[] value. Most probably this index will be smaller than jsonb index (~%10 smaller). But I am not sure whether it's faster for your case.

Here is an example for second usage.

Create an immutable function.

CREATE FUNCTION to_my_int_array (jsonb) RETURNS integer[] AS 
$$
  SELECT
      ARRAY(
        SELECT (jsonb_array_elements($1) ->> 'id')::integer
      );

$$
LANGUAGE sql
IMMUTABLE
RETURNS NULL ON NULL INPUT;

Create an expression index with the function above:

CREATE INDEX ON my_table USING gin (to_my_int_array (item));

Filter the query with function;

SELECT
    mt.uid,
    my_item.*
  FROM my_table mt,
  LATERAL jsonb_array_elements(mt.item) ae,
  LATERAL jsonb_to_record(ae) AS my_item (id int, m text, s int)
  WHERE
    to_my_int_array(mt.item) @> '{1}'::integer[];

Second question is the output. The query above also shows output for version 11. After version 12, you can simply use the query below;

SELECT
    uid,
    my_item.*
  FROM my_table mt,
  LATERAL jsonb_to_record(jsonb_path_query_first(mt.item, '$[*] ? (@.id == 1) ')) 
    AS my_item(id int, m text, s int)
  WHERE
    mt.item @> '[{"id": 1}]';

| uid | id  | m   | s   |
| --- | --- | --- | --- |
| 1   | 1   | 123 | 1   |
| 2   | 1   | yes | 2   |
| 3   | 1   | qa  | 1   |