Mysql – How to ensure that date-range queries involving multiple time zones are sargable

MySQLperformance

I am building a web analytics application for search engine traffic only. You can see some screenshots here: http://myhappyanalytics.com/

It works similar to Google Analytics but it only saves and shows you data from search traffic: visitors, keywords, pages and page views.

Since it's a application that will store some large amount of rows I want to make sure it won't overload the server in the first month after launch.

I am currently using MySQL with InnoDB engine and this is the database structure for the 4 main tables:

CREATE TABLE IF NOT EXISTS `keyword` (
  `id_keyword` int(11) NOT NULL AUTO_INCREMENT,
  `id_website` int(11) NOT NULL,
  `keyword` varchar(255) NOT NULL,
  `position` int(11) DEFAULT NULL,
  `date_add` datetime NOT NULL,
  `date_upd` datetime NOT NULL,
  PRIMARY KEY (`id_keyword`),
  KEY `fk_keyword_website1_idx` (`id_website`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=33290 ;

CREATE TABLE IF NOT EXISTS `page` (
  `id_page` int(11) NOT NULL AUTO_INCREMENT,
  `id_website` int(11) NOT NULL,
  `url` varchar(1000) NOT NULL,
  `path` varchar(1000) DEFAULT NULL,
  `date_add` datetime NOT NULL,
  `date_upd` datetime NOT NULL,
  PRIMARY KEY (`id_page`),
  KEY `fk_page_website1_idx` (`id_website`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=65167 ;

CREATE TABLE IF NOT EXISTS `page_view` (
  `id_page_view` bigint(20) NOT NULL AUTO_INCREMENT,
  `id_visit` int(11) NOT NULL,
  `id_page` int(11) NOT NULL,
  `id_website` int(11) NOT NULL,
  `date_add` datetime DEFAULT NULL,
  PRIMARY KEY (`id_page_view`),
  KEY `fk_page_view_visit1_idx` (`id_visit`),
  KEY `fk_page_view_page1_idx` (`id_page`),
  KEY `id_website` (`id_website`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=180240 ;

CREATE TABLE IF NOT EXISTS `visit` (
  `id_visit` int(11) NOT NULL AUTO_INCREMENT,
  `id_keyword` int(11) NOT NULL,
  `id_page` int(11) NOT NULL,
  `id_website` int(11) NOT NULL,
  `id_search_engine` int(11) DEFAULT NULL,
  `id_guest` int(11) DEFAULT NULL,
  `position` int(11) DEFAULT NULL,
  `ip` int(11) NOT NULL,
  `date_add` datetime NOT NULL,
  PRIMARY KEY (`id_visit`),
  KEY `fk_visit_keyword1_idx` (`id_keyword`),
  KEY `fk_visit_page1_idx` (`id_page`),
  KEY `fk_visit_website1_idx` (`id_website`),
  KEY `id_search_engine` (`id_search_engine`),
  KEY `id_guest` (`id_guest`,`timestamp`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=47335 ;

Right now with a website that has 30.000 monthly visits some queries are slow because what I need to do is to select between dates, and then for charts I group data by days.

What I use right now for the date is the field "date_add" with DATETIME column type and I store date in UTC and then I convert it to the timezone of the website.

I think the main problem is that I am doing too much conversions on the date_add field, for selecting, for comparing and for grouping, and also I am adding or subtracting the offset of the timezone.

I also don't know if I should index the date field.

Example query that I use to get the data for the visits chart:

  SELECT DATE_FORMAT(DATE_ADD(t.date_add, INTERVAL 7200 second), "%Y-%m-%d") AS chartDay,
  count(t.id_visit) AS chartVisitCount, `t`.`id_visit` AS `t0_c0`, 
  `keyword`.`id_keyword` AS `t1_c0`, `keyword`.`id_website` AS `t1_c1`,  
  `keyword`.`keyword` AS `t1_c2`, `keyword`.`position` AS `t1_c3`,  
  `keyword`.`date_add` AS `t1_c4`, `keyword`.`date_upd` AS `t1_c5`,
   `page`.`id_page` AS `t2_c0`, `page`.`id_website` AS `t2_c1`, `page`.`url` AS `t2_c2`, 
  `page`.`path` AS `t2_c3`, `page`.`date_add` AS `t2_c4`, `page`.`date_upd` AS `t2_c5`, 
  `engine`.`id_search_engine` AS `t3_c0`, `engine`.`name` AS `t3_c1`,
   `engine`.`code` AS `t3_c2`, `engine`.`host` AS `t3_c3`, 
   `engine`.`r_keyword` AS `t3_c4`, `engine`.`r_position` AS `t3_c5`, 
   `engine`.`date_add` AS `t3_c6`, `engine`.`date_upd` AS `t3_c7` 
    FROM `visit` `t` LEFT OUTER JOIN `keyword` `keyword` 
   ON (`t`.`id_keyword`=`keyword`.`id_keyword`) 
  LEFT OUTER JOIN `page` `page` ON (`t`.`id_page`=`page`.`id_page`) 
  LEFT OUTER JOIN `search_engine` `engine` ON  
   (`t`.`id_search_engine`=`engine`.`id_search_engine`) 
   WHERE ((t.id_website=21) AND ((t.date_add >= '2013-04-10 22:00:00' 
   AND t.date_add <= '2013-05-11 21:59:59'))) 
   GROUP BY DATE_FORMAT(DATE_ADD(t.date_add, INTERVAL 7200 second), "%Y-%m-%d")

One thing I had in mind is to:

  • change the date_add to a TIMESTAMP or INT and index that column
  • add another column to store just the DATE without the time, and use it when I need grouping, and also index this column
  • and in the last place, to stop saving data in UTC that needs converting, and saving it directly in the timezone of that website

So do you think this changes will improve performance? Or are there better ways to do it?

PS: For the production server I was thinking to start with a dedicated server with some 16-32GB RAM because I know that giving more memory to mysql buffers is also very important.

Best Answer

Based on the query you provided, I would say:

  1. For date_add field, I would definitely recommend that you separate the date part and time part, as this will allow you to group by a field instead of a function.
  2. Assuming that you will always be passing a id_website, I would recommend you create a composite index covering, and in this order: id_website, date_add_date, date_add_time.
  3. After that, do not perform a DATE_FORMAT on GROUP BY but simply pass date_add_date

Also, might be worth considering partitioning your main tables such as visit, either by date_add_date or by id_website, depending on your need. Might be worth checking out pitfall of table partitioning as well:

http://www.mysqlperformanceblog.com/2010/12/11/mysql-partitioning-can-save-you-or-kill-you/