Sorting XML files so that differences can then be found

diff()sortingxml

I need to compare two XML files, each of which is about 13,000 lines long.

Sadly the code that generates these files doesn't generate the data in the same order each time (the data comes from a database).

Therefore, I get false positives when using a standard line-by-line diff utility (WinMerge), even after canonicalising the XML file.

As an example of my problem:

file1:

<a>
  <b key="fruit.preferred">banana</b>
  <b key="fruit.available">pineapple</b>
  <b key="fruit.available">apple</b>
  <b key="fruit.available">orange</b>
</a>

file2:

<a>
  <b key="fruit.available">pineapple</b>
  <b key="fruit.preferred">banana</b>
  <b key="fruit.available">apple</b>
  <b key="fruit.available">orange</b>
</a>

These files are have the same content, but the position of the banana line means that they are considered different by traditional diff. Are there any tools that can perform a sort such that the files are considered the same?

By the way, the XML file structures are more complicated than the examples above!

Best Answer

I think you can use a tool such as xmldiff for this purposes.

http://diffxml.sourceforge.net/

On the tools webpage it states:

The standard Unix tools diff and patch are used to find the differences between text files and to apply the differences. These tools operate on a line by line basis using well-studied methods for computing the longest common subsequence (LCS).

Using these tools on hierarchically structured data (XML etc) leads to sub-optimal results, as they are incapable of recognizing the tree-based structure of these files.

Related Question