MySQL – Best Way to Store Correlation Data for Searches

MySQL

This is a long post where I outline a problem I faced recently and the steps I went to solving it. I ask for feedback on the solution I came up with and for recommendations for improvements.

Using an FP-growth algorithm in pyspark I created a dataset of correlated keywords for cities around the US, this would look something like

java&javascript, fort collins, python, .85

java, los angeles, c++, .65

python&php&mysql, New York, java, .4

This means in job postings in fort collins when both java and javascript appeared, python was likely to also appear with a correlation value of .85, and for jobs in los angeles that had java, c++ was likely to appear.

The problem I had with storing this data was how to search it in a table, The following table schema
id, keyword_id, city_id, correlated_keyword_id, value

Does not work as the number of keyword ids can range from 1 to N (where N is the total number of keywords) Having a table like

id, keyword1_id, keyword2_id … keywordn_id

(where anything after keyword1 would be null by default) seems like an improper solution and not very scalable.

Another idea I had was to split it into multiple tables

id, cityid, correlated_keyword, value

id, correlation_id, keyword_id

(where correlation_id is a foreign key of the id column in the above table)
While this would work if I wanted to do a search on what was correlated with java&javascript&c++
I feel like I would need to do 3 seperate queries on the second table then do a join to find which coorrelation_ids appeared in all 3 searches.
While this seems slow it also fails to take into account if the user wanted to find the correlated keywords to the above 3 only in a certain city, as what is correlated to java&javascript&c++ varies in each city.

The last solution is to store the keywords I would search on in varchar, which would allow me to search on just java or also java&javascript&…
Again this seems like a poor solution

The idea I settled on solved all of the problems of the previous two examples but only worked because the number of keywords we had to search on was less than 64.
I gave each keyword an id of a power of 2

1, 2, 4, 8, 16 …
then the table I used to store the correlation data was

id, keyword_sum, city_id, coorelated_keyword_id, value

Then if a user wanted to search on
java&javscript&c++ I could sum those 3 keywords as that would be a unique value that no other combination of keywords would ever equal.
Also adding a multi column index on keyword_sum and city_id allowed me to do fast searches if I wanted to filter on city as well.

What are your thoughts on this solution, is this the only optimized way to allow users to search on a variable number of keywords and city? Or is there another solution I did not consider, perhaps a relational database is a poor DB to use, in that case what would be better? NOSQL, Mongo?

Best Answer

There are (at least) two ways of doing this, but only one will be shown here (see edits for old version of query):

N correlated keywords (where N is arbitrary).

I constructed a table as follows (complete DDL and DML are at the bottom of this post and also on the fiddles - PostgreSQL & MySQL):

CREATE TABLE city_lang_jp_combo
(

  c_l_combo_set INTEGER NOT NULL,

  c_id      INTEGER NOT NULL,
  l_id      INTEGER NOT NULL,
  
  c_l_combo_jp DOUBlE PRECISION NOT NULL,
  
  CONSTRAINT c_l_combo_id_pk PRIMARY KEY (c_l_combo_set, c_id, l_id),
  
  CONSTRAINT c_l_combo__jp_exists_fk FOREIGN KEY (c_id, l_id)
    REFERENCES city_language (c_id, l_id)

);

and it's populated as follows:

INSERT INTO city_lang_jp_combo 
VALUES
(1, 101, 1111, 0.61),  
(1, 101, 2222, 0.61),  
(1, 101, 3333, 0.61),

(2, 102, 1111, 0.62),
(2, 102, 2222, 0.62),
(2, 102, 3333, 0.62),

...
... snipped ~ 35 rows for brevity
...

The first set of three records means that for a set 1 (1 is arbitrarily assigned, could be any INT which doesn't occur elsewhere in the table), for city 101 (Dublin) and languages 1111, 2222 & 3333 (Java, C & C++) that the probability of the three of them occurring together is 0.61 in job postings.

I also have a 4 language combination and a 2 language combination at the bottom as follows:

(13, 101, 1111, 0.25),
(13, 101, 2222, 0.25),
(13, 101, 3333, 0.25),
(13, 101, 4444, 0.25),  <=== Python

and also a two language combination:

(14, 101, 5555, 0.05),  <=== Nim
(14, 101, 6666, 0.05);  <=== Clojure

So, first I run this SQL:

SELECT 
  c.city_name AS cname, 
  cljc.c_l_combo_set AS combo, 
  cljc.c_id AS cid, 
  l.lang_name AS ln,
  cljc.c_l_combo_jp AS prob
FROM  city_lang_jp_combo cljc
JOIN language l 
  ON l.lang_id = cljc.l_id
JOIN city c
  ON c.city_id = cljc.c_id
ORDER BY cname, combo;

Result:

cname   combo   cid     ln         prob
Dublin      1   101     C++        0.61
Dublin      1   101     Java       0.61
Dublin      1   101     C          0.61
Dublin      5   101     Python     0.51
Dublin      5   101     C          0.51
Dublin      5   101     Java       0.51
Dublin      9   101     Python     0.41
Dublin      9   101     C++        0.41
Dublin      9   101     C          0.41

Dublin     13   101     Java       0.25  <<-- 4 language combo, Java,
Dublin     13   101     C          0.25  <<-- C, C++ & Python - low p-Value
Dublin     13   101     C++        0.25
Dublin     13   101     Python     0.25

Dublin     14   101     Nim        0.05  <<-- two language combo - Nim  - v. low p-Value
Dublin     14   101     Clojure    0.05  <<-- and Clojure

Madrid      4   104     C          0.64
...
... snipped
...

And, using this query, I perform a STRING_AGG (GROUP_CONCAT in MySQL) using the above as a subquery as follows:

SELECT 
  t.cname AS "City name", 
  STRING_AGG(t.ln, ', ' ORDER BY t.cname, t.combo, ln) AS "Languages", 
  t.prob AS "Probability"
FROM
(
  SELECT 
    c.city_name AS cname, 
    cljc.c_l_combo_set AS combo, 
    cljc.c_id AS cid, 
    l.lang_name AS ln,
    cljc.c_l_combo_jp AS prob
  FROM  city_lang_jp_combo cljc
  JOIN language l 
    ON l.lang_id = cljc.l_id
  JOIN city c
    ON c.city_id = cljc.c_id
  -- ORDER BY cname, combo
) AS t
GROUP BY t.cname, t.combo, t.prob
ORDER BY "City name", "Languages";

Result:

City name   Languages           Probability
Dublin  C, C++, Java                   0.61
Dublin  C, C++, Java, Python           0.25
Dublin  C, C++, Python                 0.41
Dublin  C, Java, Python                0.51
Dublin  Clojure, Nim                   0.05
Madrid  C, C++, Java                   0.64
Madrid  C, C++, Python                 0.44
Madrid  C, Java, Python                0.54
Paris   C, C++, Java                   0.62
Paris   C, C++, Python                 0.42
Paris   C, Java, Python                0.52
Rome    C, C++, Java                   0.63
Rome    C, C++, Python                 0.43
Rome    C, Java, Python                0.53

I did a performance analysis on both the PostgreSQL and the MySQL queries - they are pretty similar and the new query takes (on average) 20% - 25% as long as my first query (see the fiddle).

There's no need to use an MD5 hash or anything else to ensure uniqueness - the language names are unique by virtue of the UNIQUE index on the language table so a combination of these in lang_id or lang_name order is guaranteed to be unique - and the lang_name strings are human-readable - never underestimate the value of this factor!

Your main problem is going to be enforcing UNIQUEness - there's nothing stopping the records

Dublin     999  101     Nim        0.23  <<-- two language combo - Nim  - v. low p-Value
Dublin     999  101     Clojure    0.23  <<-- and Clojure

being inserted - and obviously only 1 of the two sets of records can be correct, but we can't enforce uniqueness over sets of records. You could create a TEMPORARY table with fields the same as the resultset above and a UNIQUE constraint on "City name" and "Languages". Then try to INSERT the query result into that table and see if it fails on the constraint... Of course, there are all sorts of ways of doing this programmatically and/or through the use of TRIGGERs.

Another problem is that there is no way of ensuring that the p-Values are realistic - i.e. if p(C & Python) is 0.6 (say), then p(C & Python & C++) can be at most 0.6 (every job posting with the first two also has C++). This will have to be enforced in code...

p.s. welcome to the forum!


Full DDL and DML...

CREATE TABLE city
(
  city_id   INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
  city_name TEXT NOT NULL,
  city_pop  INTEGER NOT NULL,
  --
  --  more fields
  --
  CONSTRAINT city_pk PRIMARY KEY (city_id),
  CONSTRAINT city_name_uq UNIQUE (city_name) -- might not always be the case,
    -- just have 'Dublin (Ire.)' and ('Dublin (Cal.)') for example
);

Populate:

INSERT INTO city (city_name, city_pop) VALUES
('Dublin', 500000), ('Paris', 2100000), ('Rome', 3000000), ('Madrid', 3250000);

The population might be useful for calculating confidence intervals and the like...

CREATE TABLE language
(
  lang_id    INTEGER GENERATED BY DEFAULT AS IDENTITY,
  lang_name  TEXT NOT NULL,
  
  CONSTRAINT language_pk PRIMARY KEY (lang_id),
  CONSTRAINT lang_name_uq UNIQUE (lang_name)
);

Populate:

INSERT INTO language (lang_id, lang_name) VALUES
(1111, 'Java'), (2222, 'C'), (3333, 'C++'), (4444, 'Python'), (5555, 'Nim'), (6666, 'Clojure');

The language_name's UNIQUE constraint means that the result of query will be unique for any combination of languages when ordered by l.lang_name (or lang_id separated by a comma - must have the comma for numbers because 1 || 11 || 111 is identical to 11 || 11 || 11).

CREATE TABLE city_language
(
  c_id     INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
  l_id     INTEGER NOT NULL,
  c_l_prob DOUBLE PRECISION NOT NULL,
  
  CONSTRAINT city_language_pk PRIMARY KEY (c_id, l_id)
  
);

This table is a sort of "insurance policy" - the c_id, l_id combination act as a FOREIGN KEY for the table below (combinations of langauges) - how can you have combinations of languages if you don't have at least have 1 of them?

However, as discussed above, we have no way of knowing if the p-Value for 1 language (per city) - say C - is <= p (2 languages) - (say C & C++) for that same city.

Populate city_language:

INSERT INTO city_language (c_id, l_id, c_l_prob) VALUES 

(101, 1111, 0.91),  (101, 2222, 0.92),  (101, 3333, 0.93),  (101, 4444, 0.94),
(102, 1111, 0.81),  (102, 2222, 0.82),  (102, 3333, 0.83),  (102, 4444, 0.84),
(103, 1111, 0.751), (103, 2222, 0.752), (103, 3333, 0.753), (103, 4444, 0.754),
(104, 1111, 0.701), (104, 2222, 0.702), (104, 3333, 0.703), (104, 4444, 0.704),
(101, 5555, 0.2), (101, 6666, 0.10);

Final table (this is the main table):

CREATE TABLE city_lang_jp_combo
(

  c_l_combo_set INTEGER NOT NULL,

  c_id      INTEGER NOT NULL,
  l_id      INTEGER NOT NULL,
  
  c_l_combo_jp DOUBlE PRECISION NOT NULL,
  
  CONSTRAINT c_l_combo_id_pk PRIMARY KEY (c_l_combo_set, c_id, l_id),
  
  CONSTRAINT c_l_combo__jp_exists_fk FOREIGN KEY (c_id, l_id)
    REFERENCES city_language (c_id, l_id)
    
    -- if this doesn't exist, then how can you have a triple combination?
);

Populate it:

INSERT INTO city_lang_jp_combo VALUES

 (1, 101, 1111, 0.61),  -- check back to city_language p !> in there...
 (1, 101, 2222, 0.61),  -- ensure that there are at least 3 records?
 (1, 101, 3333, 0.61),

 (2, 102, 1111, 0.62),
 (2, 102, 2222, 0.62),
 (2, 102, 3333, 0.62),

 (3, 103, 1111, 0.63),
 (3, 103, 2222, 0.63),
 (3, 103, 3333, 0.63),

 (4, 104, 1111, 0.64),
 (4, 104, 2222, 0.64),
 (4, 104, 3333, 0.64),
 
 (5, 101, 1111, 0.51),
 (5, 101, 2222, 0.51),
 (5, 101, 4444, 0.51),

 (6, 102, 1111, 0.52),
 (6, 102, 2222, 0.52),
 (6, 102, 4444, 0.52),


 (7, 103, 1111, 0.53),
 (7, 103, 2222, 0.53),
 (7, 103, 4444, 0.53),

 (8, 104, 1111, 0.54),
 (8, 104, 2222, 0.54),
 (8, 104, 4444, 0.54),

 (9, 101, 2222, 0.41),
 (9, 101, 3333, 0.41),
 (9, 101, 4444, 0.41),

(10, 102, 2222, 0.42),
(10, 102, 3333, 0.42),
(10, 102, 4444, 0.42),

(11, 103, 2222, 0.43),
(11, 103, 3333, 0.43),
(11, 103, 4444, 0.43),

(12, 104, 2222, 0.44),
(12, 104, 3333, 0.44),
(12, 104, 4444, 0.44),


(13, 101, 1111, 0.25),
(13, 101, 2222, 0.25),
(13, 101, 3333, 0.25),
(13, 101, 4444, 0.25),

(14, 101, 5555, 0.05),
(14, 101, 6666, 0.05);

Discussion (also in fiddle):

--
-- Now, consider cl_combo_set = 777 below - it's 
-- essentially a duplicate of the cl_combo_set = 1
--
-- But the problem is, how does one enforce uniqueness over a set of
-- records?
--


-- (777, 101, 1111, 0.33),
-- (777, 101, 2222, 0.33),
-- (777, 101, 3333, 0.33),

-- 
-- BUT, if I add a 4th
-- record to the set, the UNIQUE constraint no longer applies because
-- we've added a 4th language---
--
-- say, like this:
--
-- (888, 101, 1111, 0.15),
-- (888, 101, 2222, 0.15),
-- (888, 101, 3333, 0.15),
-- (888, 101, 5555, 0.15)
--
--
-- The p-Value of a combination of languages must be **AT MOST** <= the 
-- p-Value of the lowest p-Value for any of the languages - how to enforce?
--
--
-- For example, SQL (appears to me anyway) to be "tacked on" to virtually
-- every job ad - so, you might have (for Dublin - city code 101) -
-- SQL code say 
--
-- Now, this is a valid set of records for the 3 languages Java, C & C++
-- (999, 101, 1111, 0.33),
-- (999, 101, 2222, 0.33),
-- (999, 101, 3333, 0.33);
--
-- and say, SQL was language code 7777, one could have:
--
-- (1000, 101, 1111, 0.33),
-- (1000, 101, 2222, 0.33),
-- (1000, 101, 3333, 0.33),
-- (1000, 101, 7777, 0.33);
--
-- this would **_ALSO_** be a valid set of records for Dublin - difficlt
-- to enforce uniqueness constraints for this and also the CHECK that the
-- p-Value is less than or equal to (at most) the p-Value for any of them.