Mysql – Testing data integrity porting from old database to new

MySQL

I've created a Web/ASP.NET/MySQL database front end & back end that gets created from the XML logs from a previous Java based XML client-server that is not serving the company very well (the XML database bloats too quickly and the server begins to time out, which halts the production floor).

Any future data will come through the web front end or API calls to the back end, and we'll eventually (after testing) shut down the old client-server database.

We're about to go live for testing with the new database, and I'm worried about data integrity with the imported data. We ended up with a bunch of duplicate data (about 18%, see SO question about this here) which I've cleared up (or at least come up with a repeatable method to do so). I've also tested that each of the row ID's that are primary keys actually have data associated with their foreign key children (and their's have the same PK-FK row data association down to the lowest child table, hopefully that made sense). However, I've only tested that the rows contain data, not that they contain the right data.

My thought is to choose random log files from the old XML database, and make sure that its data is where it should be in the new MySQL version. Ideally I'd be able to automate this, but it's complex to break down the old logs into their constituent parts to insert into the database, but maybe not so much to simply search for the data contained in them.

I'm relatively stupid when it comes to DBA stuff, and I realize what I'm asking is kind of vague, but what I'm looking for is what would a DBA want to know to make sure they were working with reliable data? If this can be figured out and tested for (my forte), I'm going to feel a lot better about the data. The entire company will be relying on this data once it goes live, I need to get this right.

Best Answer

This answer may not be what I was originally looking for, but it's a sequence of events I went through to insure only unique data was inserted into the database vs. trying to purge it after the fact (and never being certain). As I indicated in my comment above, I investigated using INSERT IGNORE INTO to prevent duplicate data. To do so, I created a composite unique key that used a primary key from another table, the time stamp, and user name. If those were not unique, don't insert the data. That looked like this, the unique key is the last statement shown:

CREATE  TABLE IF NOT EXISTS `uniqueTest`.`events` (
  `Event_ID` INT(11) NOT NULL AUTO_INCREMENT ,
  `PCBID` INT(11) NOT NULL ,
  `EventType_ID` INT(11) NOT NULL ,
  `DateTime` DATETIME NOT NULL ,
  `User` VARCHAR(45) NOT NULL ,
  PRIMARY KEY (`Event_ID`) ,
  UNIQUE KEY `PDU_Index` (`PCBID`, `DateTime`, `User`),...

And then used string.Format to build the insert string like this:

INSERT IGNORE INTO EVENTS (PCBID, EVENTTYPE_ID, DATETIME, USER) 
VALUES (@eventPCBID, @EVENTTYPE_ID, '{2}', @user);

Along with this I came up with (with lots of SO help), methods to further ignore any child inserts by setting the LAST_INSERT_ID (set to variable @INSERT_ID) of the follow on statements to null. If it were null, simply don't do the insert. You can read about that here. Here's an example of how I used variables to determine the value to use for LAST_INSERT_ID:

SET @EVENT_ID = IF(ROW_COUNT() > 0, LAST_INSERT_ID(), null);

I soon realized that I had to do more work because I frequently had child data where I was using multiple VALUES statements, and needed to be able to insert that data if the ID wasn't null, or ignore it as before. This was done with UNION ALL, and you can read about it here.

The last thing that became a problem was that I'd frequently have NULL data, and MySQL didn't like having that in the sub-selects, so it became necessary to alias all the values something like this (keep in mind all these statements were being generated in a LinqPAD set of C# statements):

measurementInsertString += String.Format("SELECT @TestSteps_TestSteps_ID AS e, {0} AS MeasurementName, NULL AS MeasuredValue, NULL AS Min, NULL AS Max, {1} AS MeasuredString, {2} AS ExpectedString UNION ALL\n",
                                    stringMeasurements.First().measurementName,
                                    stringMeasurements.First().measuredString,
                                    stringMeasurements.First().expectedString);

Hopefully this will help someone else who's going down a similar path.