We have a set of large tables (millions of records) on a MySQL DB, having schema like so (simplified):
T1: id, uid, rid, text1, text2, int1, int2, ...
T2: id, rid, tag_id, created_at
T3: id, owner_id, tag_name
Indices:
T1: Primary(id), unique(uid,rid), index(rid), index(uid,int1)
T2: Primary(id), unique(tag_id,rid), index(tag_id), index(created_at)
T3: Primary(id), unique(owner_id,tag_name)
And a requirement to do a select which returns 'rids' having tag_name = XX but not YY:
SELECT t1.rid
FROM t1
LEFT JOIN t2 ON t1.rid = t2.rid
LEFT JOIN t3 ON t2.tag_id = t3.id
WHERE t1.uid = 123
AND t1.int1 = 3
AND t3.tag_name eq 'XX'
AND t3.tag_name != 'YY'
LIMIT 100
This naturally does not work, since the WHERE
does not eliminate an rid
having more than one tag. How can we achieve this with performance in mind for the large tables?
More about data:
A user represented by uid will have about 100,000 records in T1, out of which about 10% have T2 records 10,000 (rids which are tagged), and less than 10 tags in T3.
There are 1000s of users (uids) in T1.
A given rid can be one of:
- ) Has a single tag –> a single T2 record
- ) Has multiple tags –> Multiple t2 records
- ) Has no tags –> 0 T2 records
We can also alter the table structure and indices for T2 and T3 to accommodate for that, as long as we maintain ability to filter 'tags' and 'T2' creation time.
Best Answer
You can use
NOT EXISTS
as:Your index T2:index(tag_id) is already covered by T2:unique(tag_id,rid) som you can get rid of that
I don't work much with MySQL, but I get the impression that JOINs are often preferred over EXISTS/NOT EXISTS. Translating the query (Note the DISTINCT):