Mysql – How to Create Live connection between Amazon Redshift and Amazon RDS (MySQL)

MySQLoptimizationredshift

I have a transaction table in MySQL which get updated almost every second. The table has about 20 million records and counting. I have been running queries directly on this table in my app development which has been having serious impact on performance especially when I have to call some queries using aggregate functions.

I want to connect the MYSQL to Amazon Redshift.

The problem is I want my connection to Redshift to be live. ie. Get Redshift updated with every insert in my tales in mysql.

How do I create the live connection to Amazon Redshift?
Is there any other way I can manage data without using Amazon Redshift?

Best Answer

In order to create a live connection from MySQL to Redshift you need to choose an appropriate database replication method. Among all existing methods, the Change Data Capture (CDC) is the most suitable for your case.

In this method, after a dump of the initial state of the database, the log of each database operation is continuously streamed and loaded into the data warehouse, and consolidated on a schedule (configurable) and the biggest advantage is that it enables near real-time replication.

A tool that I had a good experience with is Alooma, it can support near real time connection between Amazon Redshift and MySQL data base.

Related Solutions

Mysql – Should I use DynamoDB or stick with MySQL via webservice for the mobile app syncs

I don't have experience with DynamoDB so the best I can do is offer some pointers on optimizations to look for if you stay in the hosted MySQL world.

Memory allocation. If you're using innodb be sure you have adequate buffer pool size with two log files as the lesser of 25% buffer pool size or 1 Gig. In a perfect world your buffer pool would be as large as all the .ibd files (aka sum(data_length+index_length) from information_schema.tables where engine='innodb'). Understandably this is not always possible, in which case try to allocate 60-70% of your system memory to buffer pool while leaving the rest available for disk cache. (This of course assumes this is a dedicated MySQL server not hosting apache or running your php scripts).

If you're running MyISAM tables then at least make sure your key_buffer_size can accommodate the index_length (or size of your .MYI files).

Disks. If you're needing to fall back to disk you want those to be as fast as possible. In a perfect world you'd have PCI-E based SSD cards. I imagine this is out of your budget if you're worried about charging after switching to amazon, so maybe see if your hosting provider has any options above what you currently have. SSD based SAS drives > 15k hdd sas > nl-sas.
Service layer between mysql and clients. There is memory overhead to each connection to the database that could be used toward buffer pool or disk cache. Try to have a service layer that maintains a bounded connection pool and providing application layer caching where possible.
Distribute the work load. This is tantamount to going to the "cloud" just that you're managing all machines yourself. If your app is mostly read heavy and not always required to be up to the microsecond up to date you can setup replication between many physical machines and load balance the reads across them.

Look toward getting slaves setup at different physical datacenters across the country/world/what ever's applicable to your demographic. Load balance connections to the nearest one. (This assumes traffic is first routed to your front end in these locations otherwise the problem might just be worsened).

Partitioning. This is especially true if you're using myisam and dealing with table level locks. It's also particularly useful if you have a table that has patterns of pruning old rows periodically (you can just drop partitions instead of running individual delete queries which fragment the data).

Mysql – Struggling to debug high CPU usage on Amazon RDS MySQL instance

Managed to solve this, these are the steps I followed:

Firstly, I contacted the Amazon RDS team by posting on their discussion forum, they confirmed it was the mysqld process taking up all this CPU - this eliminated a configuration fault with something else running on the physical server

Secondly I tracked down the source of the queries that were running:

SELECT `mytable`.* FROM `mytable` WHERE `mytable`.`foreign_key` = 231273 LIMIT 1

I originally overlooked this as the cause, because none of these queries seemed to be taking particularly long when I monitored the show processlist output. After exhausting other avenues, I decided it might be worth following up....and I'm glad I did.

As you can see in the show processlist output, these queries were coming from a utlility server, which runs some tactical utility jobs that exist outside of our main application code. This is why they were not showing up as slow or causing issues in our new relic monitoring, because the new relic agent is only installed on our main app server.

Loosely following this guide:

http://www.mysqlperformanceblog.com/2007/02/08/debugging-sleeping-connections-with-mysql/

I was able to trace these queries to a specific running process on our utility server box. This was a bit of ruby code that was very inefficiently iterating through around 70,000 records, checking some field values and using those to decide whether it needs to create a new record in 'mytable.' After doing some analysis I was able to determine, the process was no longer needed so could be killed.

Something that was making matters worse, there seemed to be 6 instances of this same process running at one time due to the way the cron job was configured and how long each one took! I killed off these processes, and incredibly our CPU usage fell from around 100% to around 5%!

Best Answer

Related Solutions

Mysql – Should I use DynamoDB or stick with MySQL via webservice for the mobile app syncs

Mysql – Struggling to debug high CPU usage on Amazon RDS MySQL instance

Related Question