I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) — so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index …. and then a few attributes.
If a new ID is found, insert, otherwise, update … you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once … (20 is max allowed by web host) … whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is … given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) …. by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) … should I be getting a lot more than 13 rows processed/ second?
Best Answer
INSERT
s only