An efficient filter for .sql files with just “INSERT INTO”

csvfilterscriptingsql

Several times I've found myself working on data that has been exported to .sql files that I want to process rather than import into a database.

I'm only talking about "pure data" sql files. Ones that have nothing other than INSERT INTO statements. (They may have some prolog at the top that I can ignore.)

Generally the files will have many lines beginning with INSERT INTO and ending with ; between will be a variable number of "records" beginning with (, ending with ), and separated with ,. Each "record" will be made up of a fixed number of "fields" which are also separated by ,.

The filter should strip out all the SQL statements, separators, terminators, etc. It should output just one "record" per line with each "field" separated by say a comma or tab.

In my mind it should be totally trivial and common enough that it must've been implemented many times. But I never seem to be able to find existing tools or scripts.

I usually try to tackle it in Perl by treating ),( as the record separator instead of \n but I always get stuck in annoying details.

Has anyone got (or can make) a really simple script that does this? I don't care whether it's in bash, sed, awk, Perl, Python, etc.

Here's some real example data. I have altered it to have only a few records per line. The actual files have about a million characters per line /statement before they reach the ; and begin the next one.

INSERT INTO `categorylinks` VALUES
(0,'Wikisaurus','RunJobs.php','2008-01-27
03:10:30','','','page'),(8,'Wiktionary','TEXT OF THE GNU FREE
DOCUMENTATION LICENSE','2011-01-26
23:50:34','','uppercase','page'),(12,'Wiktionary','WHAT WIKTIONARY IS
NOT\nWHAT WIKTIONARY IS NOT','2011-04-10 18:22:23','What Wiktionary is
not','uppercase','page'),(12,'Wiktionary:Help','WHAT WIKTIONARY IS
NOT\nWHAT WIKTIONARY IS NOT','2011-04-10 18:22:23','What Wiktionary is
not','uppercase','page'); INSERT INTO `categorylinks` VALUES
(2184,'Swedish_nouns','RELIGION','2011-01-06
00:04:41','','uppercase','page'),(2184,'Translations_to_be_checked_(Albanian)','RELIGION','2007-01-27
08:16:34','','uppercase','page'),(2184,'Translations_to_be_checked_(Bulgarian)','RELIGION','2007-01-27
08:16:34','','uppercase','page');

Output would be something like this. Whether fields are quoted or not probably doesn't matter. Comma or tab separated doesn't matter. Correct escaping of commas would be needed though.

 0,'Wikisaurus','RunJobs.php','2008-01-27 03:10:30','','','page'
 8,'Wiktionary','TEXT OF THE GNU FREE DOCUMENTATION LICENSE','2011-01-26 23:50:34','','uppercase','page'
 12,'Wiktionary','WHAT WIKTIONARY IS NOT\nWHAT WIKTIONARY IS NOT','2011-04-10 18:22:23','What Wiktionary is not','uppercase','page'
 12,'Wiktionary:Help','WHAT WIKTIONARY IS NOT\nWHAT WIKTIONARY IS NOT','2011-04-10 18:22:23','What Wiktionary is not','uppercase','page'
 2184,'Swedish_nouns','RELIGION','2011-01-06 00:04:41','','uppercase','page'
 2184,'Translations_to_be_checked_(Albanian)','RELIGION','2007-01-27 08:16:34','','uppercase','page'
 2184,'Translations_to_be_checked_(Bulgarian)','RELIGION','2007-01-27 08:16:34','','uppercase','page'

The data I'm usually working with is part of the dumps of Wikipedia and Wiktionary.

Best Answer

This would be trivial:

1) Import the SQL data into an SQL database

2) Output the data in the format you want with any of the SQL tools for doing this that already exist. E.g. SELECT INTO OUTFILE

And that is totally scriptable. If there are speed issues, get faster hardware, especially drives. If you absolutely want to parse this in some other language, you will probably find yourself writing a full SQL parser, just to deal with all the quotation / braces / reserved words etc. While this is definitely doable, I wouldn't consider it trivial.

Related Question