Storing multiple tags on analytics database

indexindex-tuningperformanceredshift

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..

These tags are that seller interested in querying back to understand the sales.

My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".

Let us assume mapping is (at application level)

"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE"  = z        //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES   = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...

So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..

any solution to fix this?

Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.

I got suggestion like below

EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)

I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") —- "SHOE" AND ("NIKE" OR "ADIDAS")

Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?

Solution_1:

I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.

so something like this

"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE"  = z  = 26
"ADIDAS" = aa = 27

So my column values are

`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq

And query for searching shoe(ay=51) is

maxTag <= 51 AND tag LIKE %|ay|%

And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is

minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%

Will this give any benefit? Kindly suggest any alternatives.

Best Answer

I'm still convinced that using a many-to-many lookup table (a bridge table) is still your best option here. Your concern about matching multiple rows can be fixed by proper query design. Let's say your tables are:

CREATE TABLE purchases(PurchaseID,CustomerID,PurchaseDate,...)
CREATE TABLE tags(TagID,TagType,TagName)
CREATE TABLE purchasetags(PurchaseID,TagID)

So each purchase can have multiple tags set (no limit), and just for fun, I added the ability to categorized the tags by TagType, maybe that contains things like "ProductType", "Brand", "Color", "Sport", so you have a way to tell that "shoes" is a "ProductType" tag, "Nike" is a brand tag, and "soccer" is a sport tag.

Then if you want to query (and return just single rows), just do:

SELECT *
FROM purchases 
WHERE PurchaseID IN (SELECT pt.PurchaseID 
                     FROM purchasetag pt
                     INNER JOIN tags t ON pt.TagID=t.TagID
                     WHERE t.TagName IN ('Adidas','Nike'))
GROUP BY whatever...

If you need to do fancier combo-searches (find purchases of Nike shoes or Adidas shoes, your query will have to also be fancier:

SELECT *
FROM purchases 
WHERE PurchaseID IN (SELECT pt.PurchaseID 
                     FROM purchasetag pt
                     INNER JOIN tags t ON pt.TagID=t.TagID
                     WHERE t.TagName = 'Shoes')
AND   PurchaseID IN (SELECT pt.PurchaseID 
                     FROM purchasetag pt
                     INNER JOIN tags t ON pt.TagID=t.TagID
                     WHERE t.TagName IN ('Adidas','Nike'))

Again, that's still returning a single row for each purchase that matches your desired tag combination.