Mysql – What’s the optimal way to query data using another table column as a between filter

MySQLoptimizationquery-performance

sorry for the ambiguous question title. I don't know what's it called.

The thing is let's say I've a order table storing uid , order_id and ip looks like this

| uid   | order_id |    ip
| 10000 |    1     | 2098957412
| 10001 |    2     | 2869229574
| 10000 |    3     |  825358912

and another ip table storing ip_from , ip_to and country_code

| ip_from    |   ip_to     |  country_code
| 10000      |   200000    |      GM
| 825358910  |  925358912  |      VN
| 2098957412 | 2198957412  |      FR

I'd like to do something like this

select order.uid, order.order_id, order.ip, ip.country_code
from order
left join ip
on  order.ip >= ip.ip_from
and order.ip <= ip.ip_to

to have a result like this

| uid   | order_id |    ip       | country_code
| 10000 |    1     | 2098957412  |     FR
| 10001 |    2     | 2869229574  |    null
| 10000 |    3     |  825358912  |     VN

What's the optimal query to having such a result?

Best Answer

What's the optimal query to having such a result?

Simple.

Create proper indices (order(ip) and ip(ip_from,ip_to)).

Complex.

Create a field of LINESTRING type from ip_from to ip_to. Create spatial index (RTree) by it. Use MBRContains for filtering.

From the other side - the ranges (ip_from,ip_to) must be NON-overlapped.

If it is true, then you'll obtain the result with simple

select order.uid, 
       order.order_id, 
       order.ip, 
       ( SELECT ip.country_code
         FROM ip
         WHERE ip.ip_from <= order.ip
         ORDER BY ip.ip_from DESC
         LIMIT 1 ) country_code
from order

Indices ip(ip_from) and order(ip) (or even covering index ip(ip_from, country_code)) is enough for this query.

Related Solutions

Mysql – Why is MySQL order by performance poor within a single partition of a partitioned table

Assuming that you range partition your table by the (auto incrementing) sequence and you limit your query by another value (day number), you might see, that a lot of partitions will be touched when you use:

EXPLAIN PARTITIONS SELECT * FROM table_name WHERE day_number=X ORDER BY...

The reason for that is, that there is no linkage between the partitioning attribute and the data filter. Since MySQL uses a different index for each partition, it needs to open all partitions to find all data matching day_number.

If you know the partition name, maybe by querying INFORMATION_SCHEMA.PARTITIONS before the real select, and you use MySQL-5.6+, you can use partition selection like

SELECT * FROM table_name PARTITION(partition_name) WHERE day_number=X ORDER BY...

But this is just chitchat based on a lot of assumtions. To give you a real answer, you need to post the query plan and more information about table structure - like Phil said before.

Sql-server – Cannot tune database any further; what next

Fulltext isn't going to help without refactoring to use the full text functions ( CONTAINS, FREETEXT or their table equivalents ). It also doesn't really work with leading wildcard. Hacks are available, but basically you're going to struggle to write a semantically equivalent query for fulltext. For the future consider redesigning for fulltext which has stemming ( run, runner, running ) and thesaurus ( jogger ) which could serve your searches much better than two wildcards.

SSD is unlikely to help you unless you are memory bound. Your tables (at only 500k records) are probably in-memory most of the time. Can you confirm the size of the dJobs table, and server RAM?

Enterprise Edition could help where the limitation of 64GB RAM / lesser of 4 sockets or 16 cores goes up to 8, but you're going to need a really powerful box to notice a difference. For example, the 4 really means you could have something like 4 quad-core processors totalling 16 cores, with HT enabled, you're already at 32 logical processors. The general recommended server maxdop for this type of OLTP machine would be 8 anyway. I think this unlikely to benefit because your query has more fundamental problems but you never know.

Non-clustered indexes (particularly on dJobs) are unlikely to help because the query has so many columns from this table in the SELECT and many criteria in the WHERE clause. A non-clustered would have to be so wide to cover it would be practically a duplicate of the clustered index, therefore overly expensive to maintain. As the query sorts by jobID DESC, I considered a descending index but haven't trialled this.

Partitioning, (Enterprise only) is really a great feature, but again is unlikely to help you. I did a quick investigation of partitioning on dbo.dJobs.jobJobStatus column, eg I imagine you only have a small percentage of Jobs 'active' at any one time, eg a few hundred, even a few thousand from the 500,000 records. Partition elimination would probably be cancelled out by the OR OR OR approach. Parallel scans of multiple partitions are also an Enterprise feature:

This would probably work:

SELECT TOP 20 *
FROM dJobs
    LEFT JOIN dClients on cltClientID = jobClientId
    LEFT JOIN dUsers on regUserId = jobCoordinator
    LEFT JOIN dJobStatus ON jbsID = jobJobStatus
WHERE
    (
    jobjobStatus IN ( SELECT jbsid FROM djobstatusgroupmapping WHERE jsgid = 0 )
    )
ORDER BY jobID DESC

This probably won't work:

SELECT TOP 20 *
FROM dJobs
    LEFT JOIN dClients on cltClientID = jobClientId
    LEFT JOIN dUsers on regUserId = jobCoordinator
    LEFT JOIN dJobStatus ON jbsID = jobJobStatus
WHERE
    (
    jobjobStatus IN ( SELECT jbsid FROM djobstatusgroupmapping WHERE jsgid = 0 )
    OR ( 0=0 ) OR ( 0=0 )   --<< this 'OR always true' means 'get the whole table'
    )
ORDER BY jobID DESC

This leads me into the query. The OR OR OR approach basically means 'always get the whole table'. The TOP 20 masks this design problem. The TOP also probably pushed the plan towards Nested Loops which Jon suggested was suspect. What also stands out to me about this nightmareish "scan all columns" constructed query is that you bascially have two copies of the same query (and therefore tables), one to count, one for the resultset. This might be more efficient if the data went into an intermediate table and the count was done from there for example.

Finally, this brings me to the only only thing that would actually help you (without a large-scale refactor of the code): data deletion or archiving. As mentioned, I imagine you only have a small percentage of Jobs 'active' at any one time. Carve off the 'inactive' ones into a different table. Create a view over the top of the two tables for reporting. Set up a nightly job to copy out the old records.

Having only a few thousand active jobs in your main table will most likely transform your query performance.

Some recommended reading:

Erland Sommarskog's article on these "search all columns" queries Dynamic Search Conditions in T-SQL http://www.sommarskog.se/dyn-search-2008.html

Querying Multiple Columns (Full-Text Search) http://technet.microsoft.com/en-us/library/ms142488(v=sql.105).aspx

I hope that helps!

Best Answer

Related Solutions

Mysql – Why is MySQL order by performance poor within a single partition of a partitioned table

Sql-server – Cannot tune database any further; what next

Related Question