Using a GA to report differences in XML

A few years ago, we had to implement a simple difference tool for XML.

We had a very specific need but did not want include anything too sophisticated to our project. We only needed to report the differences between two XML documents in various ways and in XML format so that we could hightlight them or do something abou them.

We had a look at a few commercial libraries such as Delta XML, but at the time, none suited our needs, either because of the licencing terms or because the tool did not report what we were after. So we developed our own, and decided to open source it, just in case someone else might be interested.

This led to the development of DiffX, a Java API for comparing XML documents. Instead of using complex tree algorithms, we decided to tackle the problem differently by viewing an XML document as an sequence of events and analysing the differences.

The idea worked great for our purpose and DiffX has been used in production in several software packages since then including PageSeeder.

The project is currently hosted on Topologi’s website, can downloaded from SourceForge and is distributed under the very lenient Artistic Licence.

But the algorithm we use is memory hungry which makes DiffX unsuitable for large documents as it builds a large matrix based on the size of each document.

As it solved our problem, so we didn’t give much thought afterwards and the project fell into neglect. But recently, I stumbled upon a good Genetic Algorithm library JGAP, and realised that this could be an interesting way of solving the problem.

So I have decided to resurrect DiffX. Rick Jelliffe‘s first reaction was ‘How fun!”. Hopefully, it will be. In the meantime, I will do some code brushing to use Java 5, remove deprecated classes, etc… Then if a GA is indeed suitable for our problem space (and if time permits!), I will try to work on some new algorithms.

More to be posted on this…

Tags: , , , ,

Leave a Reply