Tuesday, October 31, 2006

Comparing and Merging two XML files

Comparing and Merging two XML files

The problem :
I have a xml file conversion program. It converts first.xml to target.xml. I have also written a program to convert target.xml to first.xml.
The problem is to preserve the changes made to target.xml. The user is allowed to edit the target.xml manually. For example he can add an element, modify attributes etc..
The file conversion from first.xml to target.xml is not a one-one conversion. There are scenarios like one element in first.xml is equivalent to 25 elements in the target. Further the first.xml does not capture all attributes and data needed to generate target.xml. So there are lot of values which are defaulted during initial conversion. Hence we need a round trip solution . The system should be able to import target.xml and export target.xml while preserving all changes made in target.xml

The current system
xslt is used to generate target.xml from a given first.xml. Conversion of target.xml to first.xml is again handled by xslt. Does not support round trip. ie current system looses most of the changes made manually. Other technologies used XMLBeans , JSF

The approach
Following are the list of feature needed
a) Export first.xml to target.xml
b) Import target.xml to first.xml

At a concept level the import solution can be modeled as follows
Retain the current convertion logic ( ie, extract as much as data needed for first.xml)
Generate a delta of changes.
Use the delta and merge it with the target.xml generated in export process

Activities planned
1) Identify appropriate Java based solutions
2) Evaluate each solution
3) Design the final solution
4) Implement the solution

Solutions evaluated
2) XMLBeans
4) XMLUnit
5) The "3DM" XML 3-way Merging and Differencing Tool
6) DOM
7) X-Diff
8) XML Diff and Merge Tool(Alpha works)

The initial though process is to generate the target.xml. Compare it with the one used during import. (instead of storing the delta store the whole document) . Copy/overwrite/ignore the delta to the newly generated target.xml. Hence I started evaluating various options available to deeply compare the xml tree. Features needed are
a) Compare the whole document
b) Compare the elements selectively
c) Comprision has to be smart enough to ignore -- spaces, return characters, alteration of sequences, changes in positions of attributes etc..

1) XSLT --- very complex . There are couple of code fragments available out there. But I felt it was too complex to understand and maintain. Also xsl tends to be too lengthy.
2) XMLBeans -- There are two methods defined in XMLObject compareTo( Object) and compareValue(XMLObject) .
public int compareTo(Object obj)
Impelements the Comparable interface by comparing two simple xml values based on their standard XML schema ordering. Throws a ClassCastException if no standard ordering applies, or if the two values are incomparable within a partial order.

public int compareValue(XmlObject obj)
This comparison method is similar to compareTo, but rather than throwing a ClassCastException when two values are incomparable, it returns the number 2. The result codes are -1 if this object is less than obj, 1 if this object is greater than obj, zero if the objects are equal, and 2 if the objects are incomparable.

There is also an implementation availble in XMLObjectBase.java. This method does not work. I downloaded the source code and trid to fix it. I was not going any where. One of the response from bea says that these methods can handle ony simple type( what ever it is) and need developers to code for handling schema based comparision.

The story with jaxb is almost the same as xml beans. The interfaces have compareTo(Object ) . But there is no implementation.
4) XMLUnit
Comparing xml files is very easy in XMLUnit
Diff myDiff = new Diff(controlDocument, testDocument); XML Unit can identify similar, identical and also print all differences found with element name and attribute names. The limitation is the API is built around DOM ie, it works with org.w3c.dom.Document which represents the whole xml document. XML Unit does not support our B requrement
5) The "3DM" XML 3-way Merging and Differencing Tool
This project is still in alpha
6) DOM
DOM 3 has defined the behaviour for isEqualNode(). I tried the default parser in JDK 1.5 and also Xerces-J-bin.2.8.1. In either case it throws an exception saying DOM 3 implementation needed. Even Oracle parser has DOM 3 but i wonder if they have implemented isEqualNode(). The conclution is DOM 3 isEqualNode() is not yet there
7) X-Diff
No files released
8) XML Diff and Merge Tool(Alpha works)
This seems to be targeted to GUI applications . Not suitable for our requirements.

It seems comparing XML is still an under developed technology. This raises few more questions.
What happened to the do any thing and do in anywhere promise of XML. Or is this impression I got coz my exploration is limited to java world.
How does mature technologies like RDBMS handle this kind of requirements.
After-all RDBMS does handle these problems. The solution in RDBMS world are much simpler. It does not have deeply nested hierarchies. So these type of requirements are handled by manual coding. Probably we will not need sophisticated tools as in XML world.

After this exercise I had changed my approach. The new approach is use manual coding. I used XMLBeans as binding framework. Manipulate each element as java objects appropriately to handle the export and import functions.

No comments: