PostgreSQL – Structure for Storing Data, Lists, Text, Pictures in JSON Field

jsonpostgresql

We are extracting content sections from word documents and have been going back-and-forth on how to store this arbitrary data. We are using PostgreSQL and our plan was to use JSON fields to save arbitrary sections in JSON fields.

Ideally we would like to just map out all of the relations into relations, however there is a lot of diversity in the types of data (tables, text, images, lists, etc.). None of the columns in each of the documents are the same and the same section in DocumentA can have a table and the same section in DocumentB can be made up of text or picture, hence the reason to store it in JSON.

This is an example:

DocumentA: has a table with fields named

ColA|ColB|ColC
val1|val2|val3
val4|val5|val6

DocumentB: has a table document with fields named

ColB|ColC|ColD
val1|val2|val3
val4|val5|val6

DocumentC: has a table document with with just a paragraph of text and a table

Note: more than one type of data can come out of each document (e.g. DocumentA can have 2x tables and a text data).

Store it in a single field in postgreSQL as a JSON array that is flat is our first thought.

This would be a single row in PostgreSQL for (e.g. INSERT INTO documentData(id,documentName,JSONField) VALUES (1, 'DocumentA', '<JSONDATA>') documentA.

[
 {
  "Content Type":"Table"
  "ColA": "val1",
  "ColB": "val2",
  "ColC": "val3"
 },
 {
  "Content Type":"Table"
  "ColA": "val4",
  "ColB": "val5",
  "ColC": "val6"
 }
]

Another record INSERT INTO documentData(id,documentName,JSONField) VALUES (2, 'DocumentB', '<JSONDATA>'

[
 {
  "Content Type":"Table"
  "ColB": "val1",
  "ColC": "val2",
  "ColD": "val3"
 },
 {
  "Content Type":"Table"
  "ColB": "val4",
  "ColC": "val5",
  "ColD": "val6"
 }
]

Another record INSERT INTO documentData(id,documentName,JSONField) VALUES (3, 'DocumentC', '<JSONDATA>'

[
 {
  "Content Type":"Text"
  "Data": "Random paragraph of text",
 },
 {
  "Content Type": Table",
  "ColA": "val1",
  "ColE": "val2",
  "ColF": "val3"
 }
]

The main issue I see doing it this way could potentially result in difficulty of querying the data. Are there any other suggestions?

Ultimate goal is to extract contents within certain control documents. The control documents have tables that have columns and cells that have names of parts of other documents we have to reference. Storing it in RTF would be a good thought, but not sure how to deal with searching and referencing the material reliably.

Best Answer

If you're looking for alternatives other than JSON, and you just need arrays of text name-value pairs collections, you could simply use PostgreSQL's arrays:

CREATE TYPE mydoc AS (content_type text, names text[], values text[]);
CREATE TABLE documentData(id int primary key, documentName text, docs mydoc[]);
INSERT INTO documentData(id,documentName,docs) VALUES (1, 'DocumentA', '{"(Table,\"{ColA,ColB,ColC}\",\"{val1,val2,val3}\")","(Table,\"{ColA,ColB,ColC}\",\"{val4,val5,val6}\")"}'::mydoc[])

SELECT d.id, d.documentName, dd.*
FROM documentData d
LEFT JOIN LATERAL unnest(d.docs) dd ON (true);

This won't get a significant advantage over JSON, other than portability to older PostgreSQL versions, but unnest and =ANY type of operations are often much simpler than the JSON functions, so it may be easier to query and manipulate, e.g.:

SELECT d.id, d.documentName, dd.content_type, pos, name, dd.values[pos] AS value
FROM documentData d
LEFT JOIN LATERAL unnest(d.docs) dd ON (true)
LEFT JOIN LATERAL unnest(dd.names) WITH ORDINALITY AS y (name, pos) ON (true)
WHERE dd.content_type = 'Table' AND name ~ '^Col[AB]';