Mysql – Time series data for ad platform

database-designMySQL

I am trying to figure out how to store time series data for an ad platform I am working on.

Basically I want to know some strategies/solutions for storing billions of rows of data so that I can easily search it (about 6-8 indexes on the table) and get fast counts based on queries.

I tried mySQL with the tokuDB engine and this seems to be very fast but is extremely slow when I try to do a COUNT query when the rows reached about 5-8 million.

I was looking at some noSQL alternatives but since I want to be able to search this data this is probably not the best solution. I was using dynamoDB. I would have had to store the data is many places in order to account for all the searching on the data.

What I am storing is a row in the database for each click on an AD that occurs. This table will grow very fast, especially when this site gets large.

Another solution would be to separate this data per advertiser. This means each advertiser will have their own table where all their data goes into. This means it will be much smaller and the COUNT queries will be much faster. I can even split it up by advertiser and month.

My goal is to give an advertiser the ability to search and display in a paginated way all their clicks. They should be able to get data between a time period and filter by about 5-8 other indexes if they want to.

Best Answer

Did you try sphinxsearch (http://sphinxsearch.com/)? It is search engine, but not only. It can query very fast with group by, order, filters. Mysql should be good for fast writes with per-table and per-host sharding.

You can split your sphinx index to several parts and use distributed search:

http://sphinxsearch.com/docs/2.2.2/distributed.html

Partitioning is done manually. You should

  • setup several instances of Sphinx programs (indexer and searchd) on different servers;
  • make the instances index (and search) different parts of data;
  • configure a special distributed index on some of the searchd instances;
  • and query this index.

When searchd receives a query against distributed index, it does the following:

  • connects to configured remote agents;
  • issues the query;
  • sequentially searches configured local indexes (while the remote agents are searching);
  • retrieves remote agents' search results;
  • merges all the results together, removing the duplicates;
  • sends the merged results to client.
Related Question