Feeding a large XML document through JAXP in J2SE 5 (1.5) for XPath processing has some unexpected performance ramifications: the processing time is proportional to the size of the owning Document rather than the size of the sub tree being queried.
Using java.xml.xpath.XPath.evaluate() (or XPathExpression) has a noticeable overhead cost proportional to the size of the context Node's owning Document. This is true even if the XPath expression is a simple immediate-child element reference. i.e. Evaluating an XPath expression for a given Node seems to “visit” the parent, siblings, cousins, etc. in the Document.
My XML document looks something like this:
<request>
<operation1>
<param1>value</param1>
...
<param2>value</param2>
</operation1>
<operation2>
...
</operation2>
...
</request>
There are about a thousand operation elements, and each has about twenty param elements. There are a few additional descendants that I’m not listing.
The Java code grabs all the operation Nodes with a /request/child::* expression. Iterating over each operation, it consults a Map to find which Command object to pass the Node to. Each Command object extracts the param values by evaluating XPath expressions relative to the operation element.
Some black-box sleuthing by someone else revealed that sending the operations as a thousand separate XML documents was about ten times faster than sending the operations in one Document. After some time with OptimizeIt and pondering it over, I have a hypothesis.
As the XPath language is expressive enough to reference parents and siblings relative to the context node, the XPath implementation has to do some preliminary processing of the entire document tree for every evaluation. Thus each XPath evaluation to obtain the param element from the operation element still needs to examine the entire Document in some fashion. Getting M*N param elements, each time incurring the cost of processing an N+M node tree, would explain the observed behaviour.
I don’t know how I’d go about verifying that hypothesis, short of diving in the source code.
BTW, caching the XPath expression compilation using XPathExpression doesn’t change anything.
I put in a one line hack to detach each child node from the Document before passing it along for processing. I never use any parent or sibling references, nor do I forsee it. It brought the performance of the single large Document back into line with that of many small Documents.
It’s rather unusual behaviour. I wonder if the JAXP API, in its generality, hides Xalan’s API (J2SE5 uses Xalan internally) for managing and caching these Documents for XPath purposes. I remember working with Xalan in the Java 1.3/1.4 days, and there was something called a DTM for faster processing of XML via XPath. IIRC, the DTM created an integer representation of the Document tree and its Nodes. One worked directly with the DTMManager to cache and reuse this numerical representation.
May 1, 2009 at 19:39
Hi GJ, I was wondering how you went about detaching a node into a new Document object without having to reparse everything using the DocumentBuilder?
Thanks for the help!
May 2, 2009 at 05:04
IIRC,
Node.removeNode(Node)and/orDocument.adoptNode(Node).June 1, 2009 at 14:35
This strategy has worked for us as well. A 10 mb file with 12K nodes was taking 90 minutes to process. With this change, the time is now less than 1 minute.
XPathExpression nodesXpath = xpathprocessor.compile( “/Response/Nodes”);
NodeList nodes = (NodeList)nodesXpath.evaluate(document.getDocumentElement(),XPathConstants.NODESET);
int index = nodes.getLength();
for(int i = 0; i < index; i++)
{
Node singleNode = nodes.item(i);
singleNode.getParentNode().removeChild(singleNode);
…
// do other XPath.evaluate calls…
December 7, 2009 at 06:00
Thank you very much!!!
This line does wonders:
singleNode.getParentNode().removeChild(singleNode);
Gunnar
January 20, 2010 at 17:21
Hi GJ. I’ve been experiencing the same acking issue. I have an xml document in memory with just 1200 nodes.
Even I iterate through the nodes and then evaluate a local expression, i get humongous times for each call to xpath.evaluate(expr, localNode).
Let me put an example:
1</Field
2</Field
3</Field
…
Suppose I get a Node variable which contains record 1. Evaluating the expression
xpath.evaluate(“Field”,myRecordNode) takes more than 20 minutes when the xml has only 1200 records.
Is it a problem of XPATH specification or it was just developed awfully? Can I change anything (xerces, jaxp, xalan) to avoid this BIG BIG performance problem?
I must say I’ve used xpath extensively with microsoft parser and this doesn’t happen….
January 20, 2010 at 17:49
I don’t think it’s really a matter of something being implemented poorly, as it is an abstract specification / standard not being able to make use of non-standard implementation-specific optimizations.
It’s been quite awhile since I looked into this, but as I recall, the basic issue is that Xerces implemented something called “DTM” which represented the XML DOM more efficiently in memory as a fast table. However, the expectation was that you’d convert the XML document into a DTM object once, and then use the DTM from then on to evaluate XPath expressions.
In the JAXP standard, however, there’s no such notion as a “DTM”, you can only pass DOMs around. So the way that JAXP was implemented in the Sun JVM internally, was to parse each DOM into a DTM, and then make the appropriate XPath query on the DTM as per typical use of the Xerces API.
However, since there’s no way to carry a permanent association of the DOM to DTM, as the caller isn’t even aware of the DTM’s creation, the *next* call to parse an XPath reparsed the DOM into a DTM, repeating the entire process.
For small documents, this isn’t a big deal; for large documents… every time one evaluates an XPath expression, it reparses the DOM (all nodes) into a DTM… which is why the XPath evaluation time appears to be proportional to the number of nodes in the DOM. It’s the parsing of the DOM into a DTM that’s proportional to the number of nodes.
There’s no clean solution. The only “clever” solution I could think of was to somehow maintain a soft hashmap to cache DOM to DTM associations.. but if we used object equality, what happens if one of the nodes in the DOM changed? And if we use a proper value based test, you’re traversing all the nodes anyway to test for equality. And should the core JVM really try to be so clever with soft references, which would appear to bloat memory use for reasons that can’t be easily discerned?
“Impedance mismatch”, I think is the usual term. :p
To answer your question, you could try using the Xerces API directly (i.e. download the library and reference directly) and see if that helps. Or change the underlying JAXP provider to something else (e.g. Saxon). I’m a bit surprised that your “1200″ record XML takes so long, but “1200 records” doesn’t say how many elements in each record.
January 20, 2010 at 17:53
Oops… I was replying by email, and it was longer than I thought… and repeated some of the stuff in the post. I always get Xalan/Xerces mixed up.
January 21, 2010 at 17:45
Correct me if I’m wrong. JAXP is just an abstract definition, and saxon, xerces and xalan provide implementations for xml parsers, and xslt and xpath respectively.
Isn’t there any other implementation out there? Cause this small document cannot be so hard to process… I see it’s a big problem for the library as a whole…
You say that I should try to use xerces library directly. But as far as I know, xerces is not an xpath evaluator, that implementation resides on xalan. Did you mean I should try using XalanApi class directly? what would be the advantage?
The strange thing is that I’ve tried to use the same expression (let’s say /Data/Record[x]/Field[@name='Y']/@value ) but not on different nodes variables. Instead, I give up using relative expressions and use always absolute expression (from the root node I mean). The performacen decreases dramatically with a 504kb file, for instance…
January 21, 2010 at 18:22
Yes, I meant Xalan. Yes, your understanding of JAXP, Xalan, and Saxon is correct.
If you use the Xalan API directly, you can use impelementation specific optimizations that can’t be expose via the JAXP API layer. Or tell the jvm to use a different JAXP impl. Eg Saxon, which might implement Xpath eval differently.
Sorry, it’s been 4 years since I looked into this, I’m going on memory. The best solution I could come up with given my constraints at the time was the one that outlined.
If you discover a better solution, please let me know, I’d be very interested!
January 5, 2011 at 12:47
FYI – We replaced the XPath Processor with the saxon impl, and now large documents that used to take 10+ minutes take about 2-4 seconds. We previously tried the remove/replace node method as well. While that helped (took down to a couple minutes), the impact was still not good enough.
December 8, 2011 at 14:43
How did you this Paul ? Can you post an example of how you changed the Xpath processor with saxon implementation without changing YOUR code ? because for me SAXON API looks a little different, so I can’t just replace these API calls with the new ones. I already have coded everything and now I am facing issues with performance. Can you suggest me on how to do this by minimally changing the code. ?
December 8, 2011 at 17:17
Look at the simple example posted on the IBM site:
http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Saxon has implementations for the of xpath, and you gain access to these by changing the XPathFactory. You should do it in your app server settings if you’re running in it, otherwise use this guy just to test it out. You’ll want to set this at the beginning because once you use the factor you stick with it. Of course you’ll need the saxon jar in your classpath and all, and it doesn’t hurt to log the class of your factor and xpath just to make sure you’re getting what you expect.
System.setProperty(“javax.xml.xpath.XPathFactory”,”net.sf.saxon.xpath.XPathFactoryImpl”);
December 9, 2011 at 11:18
Paul .. Thanks for replying. but I am not using javax.xml.xpath.XPathFactory. I am using org.apache.xpath.XPathAPI.
The reason I am using that is, I am restricted to use java142. I cannot use anything other than because platform limitations. The apache XPathAPI I am using, does not have an XpathFactory. Is there something, I can do to make it better. please help
December 9, 2011 at 13:56
Paul: Thanks for helping. I figured out my issue. I was not modifying the DOM anyways. I am just using XML as the layout. So, what I did was, just used cachedXpathApi class to process the Xpath queries and that got down my processing time from 10+ minutes for 1200 nodes to 30 seconds. I thinkk, it is better to go for CachedXpathApi when your DOM does not change because it caches the DOM and doesnt recreate DOM everytime and convet it into DTM. Thanks for the help.
December 5, 2011 at 06:45
Thanks for the tip about removing the child node, really saved my ass!