Postgresql – How to do this binary appearance analysis

jsonpostgresql

I have a table named results:

integer: id
json: websites

An example of websites column is:

[{"url": "http://google.com"}, {"url": "http://example.com"}]

What I want to do is generate a binary appearance chart, where it shows a 1 or a 0 depending on if ANY url in the JSON blob is contained within another set.

All the possibilities are on columns, and all the results are as rows.

result_id | wikipedia.org | google.com | example.com
10         1               1            1
11         0               1            0
12         0               0            0
13         1               0            0
  • For it to return 1, ANY url in the JSON array object (with key ->>{'url'}) must match the domain
  • The URLs are full urls so ideally I would like to use regex to check if its a match (i can't do a full string match because I only care about the HOST of the URL). I'm thinking something like /https?:\/\/(example\.com)(/|?|$)/

Best Answer

Assuming data type jsonb and current Postgres 10.

Exact matches are simpler, since you can use the built-in jsonb contains operator @>:

SELECT id
     , (websites @> '[{"url": "http://wikipedia.org"}]')::int AS "wikipedia.org"
     , (websites @> '[{"url": "http://google.com"}]'   )::int AS "google.com"
     , (websites @> '[{"url": "http://example.com"}]'  )::int AS "example.com"
FROM   results
ORDER  BY id;

The cast to integer only to meet your desired form of 1/0. Or omit the cast and just use the boolean t/f instead.

Pattern matching is more sophisticated, since there is no direct jsonb operator:

SELECT id
     , count(*) FILTER (WHERE url ~ '^https?://wikipedia\.org\M') AS "wikipedia.org"
     , count(*) FILTER (WHERE url ~ '^https?://google\.com\M')    AS "google.com"
     , count(*) FILTER (WHERE url ~ '^https?://example\.com\M')   AS "example.com"
FROM   results r
LEFT   JOIN LATERAL (
   SELECT elem->>'url' AS url
   FROM   jsonb_array_elements(r.websites) w(elem)
   ) w ON true
GROUP  BY id
ORDER  BY id;

db<>fiddle here

Replace count(*) FILTER ... with bool_or(url ~ '^https?://wikipedia\.org\M') etc. if URLs are never contained more than once and/or you are not interested in the actual count.

Related:

Related Question