PostgreSQL – Creating View or Table from JSON Data with Different Shapes

database-designindexpostgresqlpostgresql-10view

I'm pulling in JSON data from two different sources that have roughly the same information, but in different shapes. Let's say I'm pulling in data about music albums, and one of the API responses looks like this:

{
  "id": "6289d24d-1a49-499b-90a7-8f9e43b4d6af",
  "artist-credit": [{
    "name": "The Hot Sardines"
  }],
  "title": "The Hot Sardines",
  "date": "2014",
  "media": [{
    "tracks": [{
      "position": 1,
      "id": "5928bc3b-a5b9-4149-8a1e-b224707dab5a",
      "title": "Bei Mir Bist Du Schoen",
      "artist-credit": [{
        "name": "The Hot Sardines"
      }],
      "length": 246000,
    },  {
      "length": 246000,
      "number": "11",
      "artist-credit": [{
        "name": "The Hot Sardines"
      }],
      "title": "(I Don't Stand) A Ghost of a Chance With You",
      "id": "be97c530-d2d4-4a68-a777-637a2c23852e",
      "position": 11
    }],
    "track-count": 11
  }],
  "packaging-id": "ec27701a-4a22-37f4-bfac-6616e0f9750a"
}

While the other looks like:

{
  "album_type" : "album",
  "artists" : [ {
    "id" : "2BTZIqw0ntH9MvilQ3ewNY",
    "name" : "Cyndi Lauper"
  } ],
  "id" : "0sNOF9WDwhWunNAHPD3Baj",
  "name" : "She's So Unusual",
  "release_date" : "1983",
  "tracks" : {
    "items" : [ {
      "artists" : [ "some artists"],
      "disc_number" : 1,
      "duration_ms" : 305560,
      "id" : "3f9zqUnrnIq0LANhmnaF0V",
      "name" : "Money Changes Everything",
      "track_number" : 1,
      "type" : "track",
    }, {
      ...
    } ],
    "total" : 13
  },
  "type" : "album",
  "uri" : "spotify:album:0sNOF9WDwhWunNAHPD3Baj"
}

I'd like to be able to query this data via SQL in PostgreSQL, with the table structure looking like:

album               (id, title,    release_date, spotify_id, musicbrainz_id)
track               (id, title,    album_id,     duration,   spotify_id, musicbrainz_id, position)
artist              (id, name,     spotify_id,   musicbrainz_id)
artist_track_credit (id, track_id, artist_id)
artist_album_credit (id, album_id, artist_id)

From what I've gathered there are a couple of ways to do this.

One would be to take each API response, convert them into SQL INSERT queries in software and then send those queries to the database like so:

INSERT INTO album (title, release_date, etc) VALUES (?, ?, ? ...);

-- Get the album ID back and use it in the following queries

INSERT INTO track (title, album_id, etc) VALUES (?, ?, ? ...), (?, ?, ? ...), ...;

INSERT INTO artist (name, etc) VALUES (?, ...), ... ON CONFLICT DO NOTHING

-- Get the new artist ids back (if created) and use them in the following queries

INSERT INTO artist_track_credit VALUES ...
INSERT INTO artist_album_credit VALUES ...

The problem here is that it is pretty messy, and requires several round trips to the DB in order to get the album ID to insert with the tracks, then the artist IDs to insert with the artist and album credits.

One alternative is to insert the API response directly into the database, and then use database functions to manipulate the data and insert it into the relational structure. The downside here is that I'm not as familiar with writing functions for postgresql, but if this is the best solution I'd go for it.

Another alternative is to take advantage of postgresql's JSON handling abilities and insert the API responses directly into the database, and then develop indices and views on that JSON data to simulate ordinary relational data. This seems promising, but I'm not sure how to handle the data having different shapes. Is it possible to write a set of views or indices for each schema, and then some kind of union that will take, say SELECT * FROM album and convert that into SELECT * FROM musicbrainz_album_view UNION SELECT * FROM spotify_album_view without having to write it out every time?

What is the best way to handle this scenario?

Best Answer

I recommend not to store the data as JSON in the database, particularly if you want to query them with SQL. Storing the data will be easy that way, but your queries will be much more complicated and slower.

If you want to use several SQL statements in client code or a function in the database should mostly depend on the question if you want to keep your business logic in the application or in the database.

There shouldn't be too many round trips necessary if you use RETURNING as in

INSERT INTO artist (name)
VALUES ('Jethro Tull')
RETURNING id;

You can even insert several rows with a construct like

WITH a AS (
   INSERT INTO artist (name)
   VALUES ('Jethro Tull')
   RETURNING id
)
INSERT INTO album (name, artist_id)
SELECT ('Catfish Rising', a.id)
FROM a;

Related Solutions

Concept, ER model, “employee tracking”

You'll probably end up with something looking a little like this: Schema This may be a little more involved than you first expected, but I'll take you through the whys and hows of it all ( doing this is generally much easier than explaining it, by the way ), starting with identifying objects. Using "Noun-Verb" analysis, which is a fairly common OOAD technique, we start picking apart your requirements:

Right away, it becomes evident that at least Employee, Position / Job and Skill are needed. Further along, a need for Department and Skill Group emerges, then finally Skill Level. From here, things like Employee Surname are noted as well.

Some liberties are then taken here ( which is not abnormal in any way, considering the quality of the list of requirements you'd normally be extracting from a client ), but the noun list to be used for the general structure, in no particular order, is as follows:

Department ( D )

Department Code ( DC )

Department Name ( DN )

Position ( P )

Position Code ( PC )

Position Title ( PT )

Skill Group ( G )

Skill Group Code ( GC )

Skill Group Name ( GN )

Skill ( S )

Skill Code ( SC )

Skill Name ( SN )

Employee ( E )

Employee Code ( EC )

Employee First Name ( EF )

Employee Surname ( EL )

Level ( L )

Level Code ( LC )

Level Name ( LN )

Using these nouns, on a relatively simple level, relationships between them can be established using "is-a / has-a," that is, "Noun A is a Noun B" or "Noun A has a Noun B." This is particularly useful for classifying our nouns into two categories, being an "attribute", an atomic entity, later called a column, or a "container," which is a collection, later called a table. During the first round of this "is-a / has-a," the goal is generally to simplify the list, effectively removing the attributes from the picture, grouping them to their containers.

The naming convention used above goes a fairly long way to making this process simple. For instance, an Employee ( E ) has an Employee First Name ( EF ) and an Employee Last Name ( EL ), forming one-to-one relationships between Employee ( E ) and the Employee First Name ( EF ) and Employee Last Name ( EL ) entities. In this way, Employee ( E ) can be classified as a container, whereas Employee First Name ( EF ) and Employee Last Name ( EL ) become attributes. Focusing on this reduction of nouns via one-to-one relationships, we can reduce the list above to the following, where the notation ( A: B, C ) means that entity A is a "container" with "attributes" B and C. It follows that:

Department ( D: DC, DN )

Position ( P: PC, PT )

Skill Group ( G: GC, GN )

Skill ( S: SC, SN )

Employee ( E: EC, EF, EL )

Level ( L: LC, LN )

With this reduction, the one-to-many relationships can now be considered. A Department ( D ) may have any ( a pluralized has a ) number of Positions ( P ) available, implicitly indicating a one-to-many relationship. Extending the above notation, in addition to the attribute list, we can now add "compound" attributes. More simply, a compound attribute is basically a container treated as an attribute. This can be written as ( P: PC, PT, ( D: DC, DN ) ), which begins to highlight the places where foreign keys will materialize. Similarly, Skill Group ( G ) may refer to a number of differing Skills ( S ).

( P: PC, PT, ( D: DC, DN ) )

( S: SC, SN, ( G: GC, GN ) )

The many-to-many relationships can then be addressed. Position ( P ) may require any number of Skills ( S ), each with it's own Skill Level ( L ). Conversely, a specific Skill ( S ) can be required for any number of Positions ( P ), at any particular Skill Level ( L ). These associations manifest as a ( is a ) Position Skill ( PS ), effectively a "surrogate" container, having a compound attribute of Skill Level ( L ). To use the phrasing above, each of these newly established Position Skills ( PS ) has a required Skill Level ( L ). To maintain this many-to-one relationship between Position Skill ( PS ) and Level ( L ), a unique key is to be placed on the foreign keys Position ( P ) and Skill ( S ).

( PS: ( { U₁ }( P: PC, PT, ( D: DC, DN ) ), { U₁ }( S: SC, SN, ( G: GC, GN ) ), ( L: LC, LN ) )

A quick note: The { U_n } is used to indicate both attributes ( themselves containers ) are part of a single unique key. If the attributes were independently unique, we could use { U₁ } and { U₂ }.

Next, an Employee ( E ) can occupy any number of Positions ( P ), indicating that there is a Employee Position noun to be added to our list. This particular relationship manifests over some period of time, thus having ( has a ) Start Date ( T1 ) and End Date ( T2 ). However, there is no way to assert that a normally part-time Employee ( E ) could not be serving under two separate Positions ( P ) in a full-time manner simultaneously in the interim before a new hire, nor that a previous Employee ( E ) could be re-hired or otherwise return to a Position ( P ) previously held by that Employee ( E ). This dilemma results in a true many-to-many relationship, with Start Date ( T1 ) and End Date ( T2 ) as attributes. As Start Date ( T1 ) and End Date ( T2 ) are not functionally dependent on anything other than the relationship between Employee ( E ) and Position ( P ), this structure is not in violation of 3NF.

( EP: ( E: EC, EF, EL ), ( P: PC, PT, ( D: DC, DN ) ), T1, T2 )

The final piece comes in the form of the relationship between an Employee's ( E ) specific term of employment, Employee Position ( EP ), the relevant Position Skills ( PS ) and the quality at which those specific Position Skills ( PS ) are realized, or simply, the achieved Level ( L ). Much like the relationship established between Position ( P ) and Skill ( S ), there is a Employee Position Position Skill ( EPPS ) which has a compound attribute of Level ( L ) in a many-to-one relationship. This again requires a unique key, leaving our noun list ( now expanded ) in the following state:

( EPPS:

    { U₂ }( EP:

        ( E: EC, EF, EL ),

        ( P: PC, PT, ( D:

            DC, DN ) ),

        T1, T2 ),

    { U₂ }( PS:

        ( { U₁ }( P:

            PC, PT, ( D:

                DC, DN ) ),

        { U₁ }( S:

            SC, SN, ( G:

                GC, GN ) ),

        ( L:

            LC, LN ) ),

    ( L:

        LC, LN ) )

That, of course, is a bit on the crazy side to read, so we extract the definitions we've reduced the problem to in an actionable way. Here, I've added additional unique keys simply for the purpose of completeness. Note that there are no soon-to-be tables which contain more than one independently unique key, so { U₁ } is used throughout, per table:

( D: { U₁ }DC, DN )

( P: { U₁ }D, { U₁ }PC, PN )

( G: { U₁ }GC, GN )

( S: { U₁ }G, { U₁ }SC, SN )

( L: { U₁ }LC, LN )

( PS: { U₁ }P, { U₁ }S, L )

( E: { U₁ }EC, EF, EL )

( EP: { U₁ }E, { U₁ }P, T1, T2 }

( EPPS: { U₁ }EP, { U₁ }PS, L }

PostgreSQL – Indexing for Attribute Combinations in JSON Array

Why?

The expressed requirement that artist and title must match in the same element of the JSON array is not reflected in your query, which finds all rows where at least one element matches the artist and another (possibly a different one) matches the title.

The example data for your first case was inconclusive, since the query cannot fail this way for a single array element. Your second example demonstrates the case well, though.

Solution in Postgres 9.3

There are various ways to fix this. One way would be to translate each json array element to an SQL array of composite type consisting of artist and title and match the whole type as one.

Another way would be to keep indexes like you have now (or even a single composite spanning both expressions: (json2arr(data, 'artist'), json2arr(data, 'title')). Your current query identifies possible matches. Unnest the json array data for all identified candidates and check whether both artist and title match on the same element. May or may not be efficient enough.

Better use `jsonb` in n Postgres 9.4

I am not going into detail for json / pg 9.3, because the release of jsonb in Postgres 9.4 with advanced indexing capabilities mostly obsoleted the problem. This can be implemented much simpler and more efficient with a native GIN index on a jsonb column. There are various options. To optimize for the presented case:

CREATE TABLE tracks (id serial, data jsonb);

INSERT INTO tracks (id, data)  VALUES
  (1, '[{"artist": "Simple Plan",    "title": "Welcome to My Life"}]')
, (2, '[{"artist": "Another Artist", "title": "Welcome to My Life"},
        {"artist": "Simple Plan",    "title": "Perfect"}]');

Index:

CREATE INDEX tracks_data_gin_idx ON tracks USING gin (data jsonb_path_ops);

Query:

SELECT * FROM tracks
WHERE  data @> '[{"artist": "Simple Plan", "title": "Welcome to My Life"}]';

Sequence of attributes in the JSON value and insignificant whitespace don't matter for jsonb. I added detailed information for jsonb to the referenced answer on SO:

Index for finding an element in a JSON array