Mysql – Database for building a realtime analytics system

columnstoredatabase-designMySQLoraclepostgresql

I want to build a system similar to Google Analytics (only used for internal use, less traffic and less feature), and mainly focus on

Real time counting of unique URI visit/PV by different dimensions of user demographic information, e.g. User Agent, OS, Country etc
Real time calculation of average user session length (if the different between two request from the same IP is less than 1 minute)

Are there any good database store enable this kind of query in real time?

p.s. I am currently testing InfiDB.

Best Answer

There is a trick to building fast realtime analytics, regardless of the platform. I've done this with Microsoft Analysis Services, but you can use similar techniques with other platforms as well.

The trick is to have a leading partition that can be populated with near-realtime data and a historical partition (or partitions) that are optimised for fast queries. If you keep the leading partition small enough it will be quick to query as well.

To manage this, your ETL processing populates the leading partition and you build a supplementary process that periodically converts the partitions to the fast query optimised format. The exact nature of this process will vary with your platform.

On MS Analysis Services the leading partition is done as a ROLAP partition that reads directly off the table. Trailing partitions are converted to MOLAP with aggregates. Other OLAP systems will work similarly. On Oracle you can create bitmap indexes and materialised view partitions on your trailing partitions to speed up queries. Some other systems have this type of feature as well, although I'm not aware of MySQL supporting it.

At a guess, I'd say the cheapest mainstream platform that would do this is MS Analysis Services, which is only available bundled with SQL Server and can cannot be purchased separately. For the partitioning with 2008 R2 you will need Enterprise Edition of SQL Server, which runs to about £22,000 per CPU socket in the UK and a bit less on the other side of the pond. Microsoft are shipping a new 'Business Intelligence' edition of SQL Server with 2012. Once this hits RTM the B.I. edition of the product does support partitioned cubes and is substantially cheaper than Enterprise Edition. Depending on your budget and time constraints you may be able to use that instead.

Another aspect of the problem you will have to tackle is changed data capture - efficiently identifying and pushing new or changed data rows into the ETL process. Most DBMS vendors' CDC features only work with their own databases, so if you want a CDC solution you may have to go to a third party app or triggers on the source.

Various third parties punt CDC applications that will migrate across database platforms. A list of CDC products can be seen on the wikipedia entry on the subject. Note that you may still have issues with integration. For example IBM Infosphere CDC can only trigger external processes on a per-row basis, rather than per batch, which could cause efficiency problems on large data volumes.
You can create a set of triggers on the source tables that push out the changes into a staging area. This would require you to have sufficient access to the source database to do this, so it may not be an option on vendor-supported databases.
If the data is from a file (for example a web server log) you would have to write a client side process that monitors the tail of the files for new records.

It is quite likely you will end up having to implement a pull process that polls the data sources. In this case you have to work out the tolerable latency and write your process so it is efficient enough at detecting changes so it can be run sufficiently quickly. There is an old saying, sometimes found in embedded systems circles, to the effect of: 'You know they're getting serious about reliability when they start polling'

Related Solutions

Mysql – Database structure for affiliate system

Separating the affiliates into their own tables would usually be a bad idea. It will not save you anything and will make reporting across populations should you need to later.

I would go further than ypercube suggests: why not record each sale individually?

Clicks:    Affiliate            ClickTime
           (FK to aff. table)   (timestamp)

Salses:    Affiliate            SaleTime      SaleAmount
           (FK to aff. table)   (timestamp)   (fixed precision decimal)

Yes this will take more space, but far from a lot - if this much space is a problem then you are working with a extrememely limited hosting solution. With daily recording as suggested by ypercube each row is going to be 28 bytes (assuming 4 for date, everything else in your table being 4-byte ints, and the affiliate ID being an int also) plus index load. Assuming you need enough indexes to more-or-less double the storage requirements (for tables that do not have many columns this is not unusual, though almost certainly overkill for these tables), 30,000 rows per month is still only ~1.6Mb/month.

The above structure gives you much more detail should you need it for reports later. If you specifically need to report on the basis of days, you can either split the date and the time into separate fields or denorlamise a tad and keep the full timestamp plus a date-only field. You can then generate a daily report for each affiliate just by grouping by the date, affiliate, and data fields and using the count and sum aggregates appropriately.

With proper FKs and other indexes the extra detail should not impose performance issues. Databases are designed for handling large amounts of data (and this is far from large) in tables - millions of rows should not be a problem in that respect, never mind tens of thousands.

Mysql – optimizing MySQL for traffic analytics system

You can simplify your query to something along these lines. I expect MySQL will generate a simpler execution plan.

SELECT date(date) period, count(*) clicks
FROM visits
WHERE url_id = 3
  AND DATE > DATE_SUB( CURRENT_TIMESTAMP( ) , INTERVAL 1 MONTH ) 
GROUP BY period;

If returning results on high-traffic links in one second is a hard requirement, you might need to upgrade your hardware.

Expect to need an index on each column used in a WHERE clause. You might benefit from some multicolumn indexes; {url_id, date} is a candidate.

Test at scale, if that's at all possible. (It's usually possible, although it might take some time.) Use EXPLAIN to see what MySQL is doing with your queries.

You don't have to query browsers, countries, and everything else all at one time. When I was running web development, I rarely looked at countries--they weren't relevant to the niche I was working in. Also consider other asynchronous UI technologies.

PostgreSQL has a better optimizer. It might work better than MySQL. Test it.

Best Answer

Related Solutions

Mysql – Database structure for affiliate system

Mysql – optimizing MySQL for traffic analytics system

Related Question