Slow XPath evaluation for large XML documents in Java 1.5

Feeding a large XML document through JAXP in J2SE 5 (1.5) for XPath processing has some unexpected performance ramifications: the processing time is proportional to the size of the owning Document rather than the size of the sub tree being queried.

Using java.xml.xpath.XPath.evaluate() (or XPathExpression) has a noticeable overhead cost proportional to the size of the context Node's owning Document.  This is true even if the XPath expression is a simple immediate-child element reference.  i.e. Evaluating an XPath expression for a given Node seems to “visit” the  parent, siblings, cousins, etc. in the Document.

My XML document looks something like this:
<request>
  <operation1>
    <param1>value</param1>
    ...
    <param2>value</param2>
  </operation1>
  <operation2>
  ...
  </operation2>
  ...
</request>

There are about a thousand operation elements, and each has about twenty param elements.  There are a few additional descendants that I’m not listing.

The Java code grabs all the operation Nodes with a /request/child::* expression.  Iterating over each operation, it consults a Map to find which Command object to pass the Node to.  Each Command object extracts the param values by evaluating XPath expressions relative to the operation element.

Some black-box sleuthing by someone else revealed that sending the operations as a thousand separate XML documents was about ten times faster than sending the operations in one Document.  After some time with OptimizeIt and pondering it over, I have a hypothesis.

As the XPath language is expressive enough to reference parents and siblings relative to the context node, the XPath implementation has to do some preliminary processing of the entire document tree for every evaluation.  Thus each XPath evaluation to obtain the param element from the operation element still needs to examine the entire Document in some fashion.  Getting M*N param elements, each time incurring the cost of processing an N+M node tree, would explain the observed behaviour.

I don’t know how I’d go about verifying that hypothesis, short of diving in the source code.

BTW, caching the XPath expression compilation using XPathExpression doesn’t change anything.

I put in a one line hack to detach each child node from the Document before passing it along for processing.  I never use any parent or sibling references, nor do I forsee it.  It brought the performance of the single large Document back into line with that of many small Documents.

It’s rather unusual behaviour.  I wonder if the JAXP API, in its generality, hides Xalan’s API (J2SE5 uses Xalan internally) for managing and caching these Documents for XPath purposes.  I remember working with Xalan in the Java 1.3/1.4 days, and there was something called a DTM for faster processing of XML via XPath.  IIRC, the DTM created an integer representation of the Document tree and its Nodes.  One worked directly with the DTMManager to cache and reuse this numerical representation.

Follow

Get every new post delivered to your Inbox.