This is a long post where I outline a problem I faced recently and the steps I went to solving it. I ask for feedback on the solution I came up with and for recommendations for improvements.
Using an FP-growth algorithm in pyspark I created a dataset of correlated keywords for cities around the US, this would look something like
java&javascript, fort collins, python, .85
java, los angeles, c++, .65
python&php&mysql, New York, java, .4
This means in job postings in fort collins when both java and javascript appeared, python was likely to also appear with a correlation value of .85, and for jobs in los angeles that had java, c++ was likely to appear.
The problem I had with storing this data was how to search it in a table, The following table schema
id, keyword_id, city_id, correlated_keyword_id, value
Does not work as the number of keyword ids can range from 1 to N (where N is the total number of keywords) Having a table like
id, keyword1_id, keyword2_id … keywordn_id
(where anything after keyword1 would be null by default) seems like an improper solution and not very scalable.
Another idea I had was to split it into multiple tables
id, cityid, correlated_keyword, value
id, correlation_id, keyword_id
(where correlation_id is a foreign key of the id column in the above table)
While this would work if I wanted to do a search on what was correlated with java&javascript&c++
I feel like I would need to do 3 seperate queries on the second table then do a join to find which coorrelation_ids appeared in all 3 searches.
While this seems slow it also fails to take into account if the user wanted to find the correlated keywords to the above 3 only in a certain city, as what is correlated to java&javascript&c++ varies in each city.
The last solution is to store the keywords I would search on in varchar, which would allow me to search on just java or also java&javascript&…
Again this seems like a poor solution
The idea I settled on solved all of the problems of the previous two examples but only worked because the number of keywords we had to search on was less than 64.
I gave each keyword an id of a power of 2
1, 2, 4, 8, 16 …
then the table I used to store the correlation data was
id, keyword_sum, city_id, coorelated_keyword_id, value
Then if a user wanted to search on
java&javscript&c++ I could sum those 3 keywords as that would be a unique value that no other combination of keywords would ever equal.
Also adding a multi column index on keyword_sum and city_id allowed me to do fast searches if I wanted to filter on city as well.
What are your thoughts on this solution, is this the only optimized way to allow users to search on a variable number of keywords and city? Or is there another solution I did not consider, perhaps a relational database is a poor DB to use, in that case what would be better? NOSQL, Mongo?
Best Answer
There are (at least) two ways of doing this, but only one will be shown here (see edits for old version of query):
N correlated keywords (where N is arbitrary).
I constructed a table as follows (complete DDL and DML are at the bottom of this post and also on the fiddles - PostgreSQL & MySQL):
and it's populated as follows:
The first set of three records means that for a set 1 (1 is arbitrarily assigned, could be any INT which doesn't occur elsewhere in the table), for city 101 (Dublin) and languages 1111, 2222 & 3333 (Java, C & C++) that the probability of the three of them occurring together is 0.61 in job postings.
I also have a 4 language combination and a 2 language combination at the bottom as follows:
and also a two language combination:
So, first I run this SQL:
Result:
And, using this query, I perform a STRING_AGG (GROUP_CONCAT in MySQL) using the above as a subquery as follows:
Result:
I did a performance analysis on both the PostgreSQL and the MySQL queries - they are pretty similar and the new query takes (on average) 20% - 25% as long as my first query (see the fiddle).
There's no need to use an MD5 hash or anything else to ensure uniqueness - the language names are unique by virtue of the UNIQUE index on the
language
table so a combination of these in lang_id or lang_name order is guaranteed to be unique - and the lang_name strings are human-readable - never underestimate the value of this factor!Your main problem is going to be enforcing
UNIQUE
ness - there's nothing stopping the recordsbeing inserted - and obviously only 1 of the two sets of records can be correct, but we can't enforce uniqueness over sets of records. You could create a
TEMPORARY
table with fields the same as the resultset above and aUNIQUE
constraint on "City name" and "Languages". Then try toINSERT
the query result into that table and see if it fails on the constraint... Of course, there are all sorts of ways of doing this programmatically and/or through the use ofTRIGGER
s.Another problem is that there is no way of ensuring that the p-Values are realistic - i.e. if p(C & Python) is 0.6 (say), then p(C & Python & C++) can be at most 0.6 (every job posting with the first two also has C++). This will have to be enforced in code...
p.s. welcome to the forum!
Full DDL and DML...
Populate:
The population might be useful for calculating confidence intervals and the like...
Populate:
The
language_name
'sUNIQUE
constraint means that the result of query will be unique for any combination of languages when ordered byl.lang_name
(orlang_id
separated by a comma - must have the comma for numbers because1 || 11 || 111
is identical to11 || 11 || 11
).This table is a sort of "insurance policy" - the
c_id, l_id
combination act as a FOREIGN KEY for the table below (combinations of langauges) - how can you have combinations of languages if you don't have at least have 1 of them?However, as discussed above, we have no way of knowing if the p-Value for 1 language (per city) - say C - is <= p (2 languages) - (say C & C++) for that same city.
Populate
city_language
:Final table (this is the main table):
Populate it:
Discussion (also in fiddle):