Well it's really a matter of you deciding on the architecture you want to go with.
SimpleDB is a non-relational "no-sql" data storage system built around the ideas of collections of items accessible via key/value pairing.
Where as RDS is really just a custom version of MySQL which a fully relational SQL database.
both systems will work for tracking profiles but the implementation patterns are very different. Neither is right or wrong, so it's your call on what you want to do.
I realize you've already worked around the issue, but here is the explanation and the fix for what you encountered. It's nice to be able to do it with the pipe, so you don't have to write that big file somewhere.
As you know, when you type a query into the mysql command line client, while that query is executing, the client doesn't accept any more input from you. That same thing essentially happens when you're stringing the these tools together with a pipe -- the target mysql client waits for the target server to return from the query before it can read more data from the pipe from mysqldump.
Meanwhile, as mysqldump is being blocked on writing to the pipe, because we're waiting for the target server to finish a query, it stops reading data from its socket connection. The OS will only buffer so much data before it stops accepting data from the origin server.
If we do something like ENABLE KEYS
which takes a while, we hit a timeout... on the origin MySQL server. But the timeout value can be changed.
MySQL Server has two timers, net_write_timeout and net_read_timeout, which default to 60 and 30 seconds respectively, and which will cause the serve to tear down a client connection when the timer is exceeded while blocking on a write to or a read from the network.
Most likely, it's net_write_timeout you're hitting, so the origin server is giving up on the connection from mysqldump, which isn't accepting data fast enough (because it's blocked on its output). Or, more precisely, it isn't accepting data often enough. If any one statement takes too long to execute, the game is over, so, on the origin server:
SET GLOBAL net_write_timeout = 3600; # one hour
This is setting approximately how much time we can wait for any single query to finish executing on the destination server without the origin server timing out. Normally, you wouldn't want MySQL to sit and wait for an hour for a blocking client, so set the timer back after you're done.
Best Answer
Since you're looking at storing and querying 135,000,000 * 30 * 10 records that likely wouldn't benefit from traditional RDBMS features, I think Hadoop would be the way to go.
My experience is exclusively with Microsoft Azure - if you're not attached to Amazon you might check it out. Either way, Hadoop is open source so the majority of operations and activity should be the same regardless of your platform provider. I'd go with an option that allows you to test the performance of different cluster sizes so you're only paying for what you really need (especially with the 10x increase as an unknown right now).
The following link, despite its Azure focus, gives you a good tutorial of how to query log4j records with Hive. Amazon EMR appears to have great documentation.
Key points for your situation: Create a table structure over your files:
With the table defined you can then query similar to SQL:
With the
EXTERNAL
definition you can have multiple files and the table will include them all, so just drop them in the appropriate directory. If you're keeping 30 days worth you may want to get fancier and separate days into different folder structures/partition your Hive table.