I use mysql under Linux. My table is about 8000000 rows. The table has index(advertiser_id,activity_date),(activity_date,advertiser_id,activity_type).
The table is looks as following:
activity_date advertiser_id activity_type message
2014-06-01 00:00:16 80149 11 Raised max bid
2014-06-01 00:00:16 80149 12 Set Bid Management
2014-06-01 00:01:49 80149 11 Raised max bid
2014-06-01 00:01:49 80149 12 Set Bid Management
2014-06-01 00:03:17 37 12 Lowered daily budget
2014-06-01 00:03:17 59 11 Lowered max bid
2014-06-01 00:03:17 59 14 Raised daily budget
Each row is an activity. I try to write SQL for each time one aid of activity_type:11(UPDATE_AD_BID), count the all other actions of this aid within one day. if one aid have a activity_type:11(UPDATE_AD_BID) at 2014-03-14 16:21:02, I need all the activity of this aid from 2014-03-13 16:21:02 to 2014-03-14 16:21:02. The SQL query I wrote is as following
SELECT bid_change.advertiser_id,bid_change.activity_date,
CASE bid_change.activity_type WHEN 11 THEN IF(bid_change.message Like '%Raised%', 1, -1) ELSE 0 END as bid,
sum(CASE change_type.activity_type WHEN 11 THEN 1 ELSE 0 END) as UPDATE_AD_BID,
sum(CASE change_type.activity_type WHEN 12 THEN 1 ELSE 0 END) as UPDATE_AD_BUDGET,
sum(CASE change_type.activity_type WHEN 14 THEN 1 ELSE 0 END) as CREATIVE,
FROM (select * from post_funnel) as bid_change
LEFT JOIN (select * from post_funnel) as change_type
ON bid_change.advertiser_id = change_type.advertiser_id
AND change_type.activity_date BETWEEN bid_change.activity_date - INTERVAL 1 DAY AND bid_change.activity_date -INTERVAL 1 SECOND
WHERE bid_change.activity_type = 11
GROUP BY bid_change.advertiser_id, bid_change.activity_date;
The SQL work fine with small test data set. But when I use it with large data set. It took forever to run. Is there anyone I can make the SQL more efficient? or I need to look for a big data solution with Hadoop or so?
update:
output should be as following
activity_date advertiser_id bid UPDATE_AD_BID UPDATE_AD_BUDGET CREATIVE
2014-06-01 00:00:16 80149 1 0 0 0
2014-06-01 00:01:49 80149 1 1 1 0
2014-06-01 00:03:17 59 -1 0 0 0
Best Answer
Some notes:
No need for the derived tables. Most, if not all MySQL versions will try to materialize the subresults of the derived tables and the subsequent join operations will not be able to use any indexes. Check the
EXPLAIN
output to verify this. Moreover, if the resulting tables do not fit in memory, they will be saved temporarily into disk, causing more delay.I think that two indexes, on
(activity_type, advertiser_id, activity_date)
and on(advertiser_id, activity_date, activity_type)
will be more useful than those you have. The first will be used for thebid_change
and the second for joining thepost_funnel
.If the
message
column is not some huge varchar or text datatype, including it in the 1st (only) above index will further improve efficiency.I strongly suggest you don't use
BETWEEN
for date and datetime comparisons but inclusive-exclusive ranges (qt >= start_datetime AND qt < end_datetime
).(detail) No need to use
CASE bid_change.activity_type WHEN 11
in theSELECT
when you haveWHERE bid_change.activity_type = 11
in the query.(detail) You can rewrite the other
CASE
using eitherCOUNT()
instead ofSUM()
or using theSUM(condition)
mysql trick which makes them a bit simpler:In all, the query can be rewritten: