Mysql – Which database system can break me free

cassandraMySQLnosqlreplication

I am currently running a MySQL Database for logging and analyzing those logs.

My current table schema looks like this:

CREATE TABLE `mylogs` (
`transfer_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`client_id` INT UNSIGNED NOT NULL ,
`client_ip` INT UNSIGNED NOT NULL ,
`server_1_ip` INT UNSIGNED NOT NULL ,
`server_2_ip` INT UNSIGNED NOT NULL ,
`service` ENUM(  'service1',  'service2',  'service3',  '...',  'service500' ) NOT NULL ,
`mb_transferred` FLOAT UNSIGNED NOT NULL ,
`time` TIMESTAMP NOT NULL
) ENGINE = MYISAM ;

I run a service where I deliver a lot of downloads that go over 1 extra hosts between the destination and origin host.
They are represented by the integer interpretation of the 32 bit IP address.

My system currently handles about 500 inserts/second during peak hours.
I run a master-slave system. The master has an apache webserver with a PHP file that gets called from remote hosts and inserts a line into the log table.
Then the changes get replicated to the slaves where queries happen.

My queries are primarily aggregations over the mb_transferred field over a range in the time field filtered by client_id.

SELECT SUM(mb_transferred) FROM mylogs WHERE client_id = 123 AND time > '2012-01-01 00:00:00'

The maser server runs an apache webserver with a simple php file that does the insert and is called by other servers.

My server is now almost at the limit. I already upgraded to big hardware.

I thought about using a GUID as primary key and using master master replication, that will for sure relieve something, but I think its short sighted, because it does not decrease the insert amount per server.

I am expecting higher trough-puts in the future and I am also worried about database size.

Also in future I plan to have a second table which defines "weights" for certain services.

Something like:

CREATE TABLE  `mylogs`.`service_weight` (
`plan` TINYINT NOT NULL ,
`service_name` ENUM(  'service_1',  'service_2' ) NOT NULL ,
`weight` FLOAT NOT NULL ,
PRIMARY KEY (  `plan` ,  `service_name` )
) ENGINE = MYISAM ;

Then I want to run join queries against this table and multiply the mb_transferred field with a weight factor.

I also want to add fields like "transfer_duration" to the logs table to calculate the speed of downloads and run queries to get statistical data how how well/bad the connection between certain networks, or certain servers for certain hosters is.

The point is. The data structure is simple, its just a huge amount of rows.

I have a lot of aggregation functions. This makes a light bulb in the "map reduce" section of my brain flashing.

I thougth about doing vertical shards and use client_id as a breaking point. For example if I have 10 server send every user to its userid mod 10 server.
This would be easy and relieve the load. But scaling will probably be awkward.

So i think with the size of the project that I am expecting to reach soon with the current growth I cannot do anything but turn towards a distributed database system.

I already tried to examine cassandra, project voldemort, amazon dynamodb and hbase but no matter how much I read I seem to run against walls.

I think the long years of relational thinking are somehow blockading my mind.

Can someone point me into the right direction on that?
What database system would be suited for my use case and why?

  • I dont need absolute up to the second data, a replication delay is acceptable.
  • I will not have updates, only inserts.
  • I would like to have a fault tolerant system
  • The data amount is getting huge so a sharded system would be nice
  • I have heavy I/O load

Best Answer

There's a lot to consider. I'm going to narrow this down to a single problem and offer one solution, although there's many approaches.

Problem: You need AD-HOC queries on lots of rows.

First, setup a server to handle writes and a replicated slave (running some flavor of MySQL 5.5.x) to handle reads. My personal preference is Percona Server. Your mileage may vary.

Once you have replication caught up, ALTER your slave table to: engine=InnoDB, ROW_FORMAT=COMPRESSED, partition by date, and index the data to optimize your query WHERE clauses. Each of these have an optimization for your case and there is further tuning required.

Finally, setup a schedule to archive/shard the data when it falls out of scope. Say 2-3 years. Depends on your SLA for providing back reporting. You can always union it if you have to.

MySQL 5.6 will have Memcached built in which will allow you to store pre-processed results. Map/Reduce is great, as long as it's not the first run of the reduce function. NoSQL is great as long as the data is already processed. Neither is really your problem.

Additions 2012-06-13: Since it appears my recommendations have stirred another response, I'll submit my reasons:

  1. Why InnoDB in this case? With the constant insertions of many rows, you can use InnoDB to avoid implicit read locks at the table level.
  2. Why COMPRESSED + Barracuda? "The compression means less data is transferred between disk and memory, and takes up less space in memory. The benefits are amplified for tables with secondary indexes, because index data is compressed also."
  3. Why partition by date? Reduce scan breadth.
  4. Why index properly? Decrease search times by reducing cpu bottleneck through elimination of full table scans.