MySQL – Join Query with Different Column Types in ‘ON’ Clause

join;MySQL

In need to make a JOIN on 2 tables based on columns which have different column types.

On table A, I have a DATETIME field and, on table B, I have a DATE and a TIME field, which combined would match the DATETIME field on table A.

What would be the recommended syntax for best performance on such join clause?

BD: MySQL 5.5.43-0+deb8u1-log

PD: Any extra info needed?

Best Answer

You can use ADDTIME() function:

tableA AS a JOIN tableB AS b
  ON a.datetime_column = ADDTIME(CAST(b.date_column AS DATETIME), b.time_column)

This might use an index on tableA (datetime_column) but not an index on tableB. The reverse might use an index on tableB (date_column, time_column) but not on A:

tableA AS a JOIN tableB AS b
  ON  CAST(a.datetime_column AS DATE) = b.date_column
  AND CAST(a.datetime_column AS TIME) = b.time_column

It won't hurt testing both versions. If one table is much larger than the other, then prefer to have the larger table's columns exposed (not cast) so their index might be used.

If you move to MariaDB (any version 5.3+) or MySQL 5.7 (when it's released), you can define a VIRTUAL column (or two) in one of the two tables to hold this conversion/calculation that can be persisted and indexed.

In 5.5, if efficiency is not good, which is expected with large tables, you could add a computed column yourself but it would have to be populated during inserts and kept in sync during updates by you (e.g. using triggers).

Related Solutions

Mysql – Slow query with straight_join for small result

The query optimizer is free to rearrange the join-order of tables in a query to any logically-consistent sequence based on its estimates of the costs of the query... unless you use STRAIGHT_JOIN, which forces the optimizer to read the left table before the right table in that particular join. (In MySQL, you can also SELECT STRAIGHT_JOIN ... which forces all the tables to be handled in the order specified in the FROM clause).

The reason for doing this is for force the optimizer to choose a plan that you know to be better than the one it's choosing on its own. In your case, sometimes that's a better plan, and sometimes it isn't.

You only posted one EXPLAIN, but I strongly suspect you'll find the EXPLAIN to be different for the query without the STRAIGHT_JOIN, which will probably make the performance discrepancy more readily apparent. It's almost inconceivable that the plan is the same, since the performance is so different.

There's another problem with the design of your query, which might be contributing to the poor performance when the query plan changes:

WHERE ...
DATE(`Mention`.`indexed`) BETWEEN "2012-11-04" AND "2012-12-04"

This is syntactically valid, but bad practice, because you're telling the server "for each row we haven't eliminated with other attributes in the WHERE clause or joins, evaluate Mention.indexed using the DATE() function and eliminate the rows where the resulting answer is not between "2012-11-04" AND "2012-12-04".

Change to this:

WHERE ...
`Mention`.`indexed` BETWEEN '2012-11-04' 
                        AND DATE_SUB(DATE_ADD('2012-12-04',INTERVAL 1 DAY),INTERVAL 1 SECOND)

The optimizer will evaluate the two expressions only once, and the second expression evaluates to '2012-12-04 23:59:59'. So now you have two constants, which can be used to match rows with the index on Mention.indexed using a range scan if the optimizer thinks that's a good idea. As your query is written, that index can't be used for filtering rows.

"But wait," someone says, "the EXPLAIN says it's using that index." Yes, it's using it to sort the results, but it's not using it for eliminating non-matching rows, because putting a formula on the left side of the where clause almost always eliminates the possibility of an index being used on the columns being passed as arguments into the function.

When you see Using where in the Extra column, that is the optimizer saying "With the query plan I've selected, I'm going to have to ask the underlying storage engine for more rows from this table than we actually want, and filter them at the MySQL layer using something from the WHERE clause to find what we actually need."

Mysql – Join on with modified ON

This should work if you want to join two tables by day and do counts on different conditions of the columns.

SELECT *
FROM (
SELECT 
    SUM(CASE WHEN crit1 = "AAA" THEN 1 ELSE 0 END) as TheAs,
    SUM(CASE WHEN crit1 = "BBB" THEN 1 ELSE 0 END) as TheBs,
    SUM(CASE WHEN crit3 = "CCC" THEN 1 ELSE 0 END) as TheCs,
    date(time) someDay
    FROM recent_items
    GROUP BY date(time)
) as recitems
JOIN (
    SELECT 
    SUM(CASE WHEN crit1 = "AAA" THEN 1 ELSE 0 END) as TheAs,
    SUM(CASE WHEN crit1 = "BBB" THEN 1 ELSE 0 END) as TheBs,
    SUM(CASE WHEN crit3 = "CCC" THEN 1 ELSE 0 END) as TheCs,
    date(time) someDay
    FROM insider_trades
    GROUP BY [timestamp]
) as InInfo ON recitems.someDay = InInfo.someDay

This may be slow depending on the size of the data that is being queried. Limiting the number of days you are trying to get data for by adding where clauses to the inner queries will help.

Best Answer

Related Solutions

Mysql – Slow query with straight_join for small result

Mysql – Join on with modified ON

Related Question