Mysql – WHERE on SQL with JOIN on large tables

join;MySQLwhere

We have a set of large tables (millions of records) on a MySQL DB, having schema like so (simplified):

T1: id, uid, rid, text1, text2, int1, int2, ...
T2: id, rid, tag_id, created_at
T3: id, owner_id, tag_name

Indices:

T1: Primary(id), unique(uid,rid), index(rid), index(uid,int1)
T2: Primary(id), unique(tag_id,rid), index(tag_id), index(created_at)
T3: Primary(id), unique(owner_id,tag_name)

And a requirement to do a select which returns 'rids' having tag_name = XX but not YY:

SELECT t1.rid 
FROM t1 
LEFT JOIN t2 ON t1.rid = t2.rid 
LEFT JOIN t3 ON t2.tag_id = t3.id 
WHERE t1.uid = 123
AND t1.int1 = 3
AND t3.tag_name eq 'XX'
AND t3.tag_name != 'YY'
LIMIT 100

This naturally does not work, since the WHERE does not eliminate an rid having more than one tag. How can we achieve this with performance in mind for the large tables?

More about data:

A user represented by uid will have about 100,000 records in T1, out of which about 10% have T2 records 10,000 (rids which are tagged), and less than 10 tags in T3.
There are 1000s of users (uids) in T1.

A given rid can be one of:

) Has a single tag –> a single T2 record
) Has multiple tags –> Multiple t2 records
) Has no tags –> 0 T2 records

We can also alter the table structure and indices for T2 and T3 to accommodate for that, as long as we maintain ability to filter 'tags' and 'T2' creation time.

Best Answer

You can use NOT EXISTSas:

SELECT t2.rid
FROM t2
WHERE NOT EXISTS (
    SELECT 1 
    FROM t3
    WHERE t2.tag_id = t3.id
      AND t3.tag_name <> 'YY'
)
AND EXISTS (
    SELECT 1 
    FROM t3
    WHERE t2.tag_id = t3.id
      AND t3.tag_name <> 'XX'
);

Your index T2:index(tag_id) is already covered by T2:unique(tag_id,rid) som you can get rid of that

I don't work much with MySQL, but I get the impression that JOINs are often preferred over EXISTS/NOT EXISTS. Translating the query (Note the DISTINCT):

SELECT DISTINCT t2.rid
FROM t2
JOIN t3 AS t31
    ON t2.tag_id = t31.id
   AND t31.tag_name = 'XX'
LEFT JOIN t3 AS t32
    ON t2.tag_id = t32.id
   AND t32.tag_name <> 'YY'
WHERE t32.id IS NULL;

Related Solutions

Mysql – How to restructure this slow query containing subquery

Is it safe to assume that for every relevant post the tag 88 occurs exactly once and the tag 5 also occurs exactly once?

Is there an index on post_id in table post_tags?

If the answer is yes to both, something like this might work:

select
      t1.id,
      t1.name,
      count (*) - 2 as kpl
from
              post_tags pt1
   inner join post_tags pt2 on (pt2.post_id = pt1.post_id)
   inner join post_tags pt3 on (pt3.post_id = pt1.post_id)
   inner join tags t1 on (t1.id = pt3.tag_id)
where
         pt1.tag_id = 88
    and  pt2.tag_id = 5
group by
   t1.id,
   t1.name
having
   count (*) > 12;

MySQL performance issues when saving Bitcoin data

I am not aware of any Bitcoin-specific best database practices. I also see no need for them, as just general good database design will help you.

The key to achieving decent database performance is to define indices in a way that every access you every need can take advantage of them. Where you do not succeed in this, just about any query will revert to iterating over every record in the table, which obviously becomes slow when you reach gigabytes of stored data.

A good place to start will be to define a (ideally integer, i.e. BIGINT) unique index, also known as primary key. If you do not do that yourself, mysql will just invent one anyways---and if you do, you automatically have a fast way to refer to transactions (in database parlance: records in the vin and vout tables). If you invent a new field for it---as I would suggest---you may want to call it id and use AUTO_INCREMENT to have mysql take care of inventing new unique values for new records.

You may also want to have a close look at your VARCHAR entries. These are essentially just (very big) integers, and if you define them as such (check out the DECIMAL and NUMERIC types), handling indices on them becomes computationally much cheaper because complications in string sorting such as collation do not arise. If you prefer to work with the hexadecimal representations, then that can be achieved with suitable VIEWs into your tables, but be warned that using these is a frequent source of new performance problems since optimization to index-usage after querying a view is a complicated subject. To get this right, it may be safer to do all conversions between hex and large-integer representation outside the database.

To increase your chances at getting better and more detailled answers than I have just given, I very strongly suggest you

Look to a developer or database community, because none of the relevant issues are very Bitcoin-specific, at best they are specific to database applications involving very large integers.
Provide relevant information such as what you've already done---or do you expect someone to completely re-invent an indexing scheme neither knowing what queries you like to run, nor what you have already managed to get right youself?

Good luck!

Best Answer

Related Solutions

Mysql – How to restructure this slow query containing subquery

MySQL performance issues when saving Bitcoin data

Related Question