AWS Database Services for Logging Records like log4j or log4perl

amazon-rdsawsdatabase-recommendationdynamodb

I have been using the file system for log4j and log4perl records and want to start using a database to store the information, especially to lookup records organized by pid. We are using MySQL RDS for most of our databases, so my initial reaction is to continue using so (or experimenting with Aurora RDS), but RDS requires InnoDB for snapshots and I would have preferred MyISAM for speed.

I prefer to keep 30 days of records, at 135M records a day. This could increase by 10x. The most important thing here is speed of inserts, although I would like to be able to retrieve by date or pid within 5 minutes.

RDS services are relational databases, so the retrieval method I require is a good match, but it isn't designed for the huge amount of data being inserted without constraints.

Are SimpleDB or DynamoDB designed for logging records more so than MySQL or Aurora?

Best Answer

Since you're looking at storing and querying 135,000,000 * 30 * 10 records that likely wouldn't benefit from traditional RDBMS features, I think Hadoop would be the way to go.

My experience is exclusively with Microsoft Azure - if you're not attached to Amazon you might check it out. Either way, Hadoop is open source so the majority of operations and activity should be the same regardless of your platform provider. I'd go with an option that allows you to test the performance of different cluster sizes so you're only paying for what you really need (especially with the 10x increase as an unknown right now).

The following link, despite its Azure focus, gives you a good tutorial of how to query log4j records with Hive. Amazon EMR appears to have great documentation.

Key points for your situation: Create a table structure over your files:

    CREATE EXTERNAL TABLE log4jLogs
    (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE LOCATION 'yourdrive:///logs/';

With the table defined you can then query similar to SQL:

    SELECT t4 AS sev, COUNT(*) AS ErrorCount
    FROM log4jLogs
    WHERE t4 = '[ERROR]'
    AND INPUT__FILE__NAME LIKE '%.log' GROUP BY t4;

With the EXTERNAL definition you can have multiple files and the table will include them all, so just drop them in the appropriate directory. If you're keeping 30 days worth you may want to get fancier and separate days into different folder structures/partition your Hive table.

Related Solutions

Mysql – Amazon SimpleDB or Amazon RDS

Well it's really a matter of you deciding on the architecture you want to go with.

SimpleDB is a non-relational "no-sql" data storage system built around the ideas of collections of items accessible via key/value pairing.

Where as RDS is really just a custom version of MySQL which a fully relational SQL database.

both systems will work for tracking profiles but the implementation patterns are very different. Neither is right or wrong, so it's your call on what you want to do.

Mysql – moving a 13 gb table to amazon rds

I realize you've already worked around the issue, but here is the explanation and the fix for what you encountered. It's nice to be able to do it with the pipe, so you don't have to write that big file somewhere.

As you know, when you type a query into the mysql command line client, while that query is executing, the client doesn't accept any more input from you. That same thing essentially happens when you're stringing the these tools together with a pipe -- the target mysql client waits for the target server to return from the query before it can read more data from the pipe from mysqldump.

Meanwhile, as mysqldump is being blocked on writing to the pipe, because we're waiting for the target server to finish a query, it stops reading data from its socket connection. The OS will only buffer so much data before it stops accepting data from the origin server.

If we do something like ENABLE KEYS which takes a while, we hit a timeout... on the origin MySQL server. But the timeout value can be changed.

MySQL Server has two timers, net_write_timeout and net_read_timeout, which default to 60 and 30 seconds respectively, and which will cause the serve to tear down a client connection when the timer is exceeded while blocking on a write to or a read from the network.

Most likely, it's net_write_timeout you're hitting, so the origin server is giving up on the connection from mysqldump, which isn't accepting data fast enough (because it's blocked on its output). Or, more precisely, it isn't accepting data often enough. If any one statement takes too long to execute, the game is over, so, on the origin server:

SET GLOBAL net_write_timeout = 3600; # one hour

This is setting approximately how much time we can wait for any single query to finish executing on the destination server without the origin server timing out. Normally, you wouldn't want MySQL to sit and wait for an hour for a blocking client, so set the timer back after you're done.

Best Answer

Related Solutions

Mysql – Amazon SimpleDB or Amazon RDS

Mysql – moving a 13 gb table to amazon rds

Related Question