Experiments with XML XPath libraries on JVM

28-06-2014

These days I mostly program in Scala. Few weeks ago I ran into a problem to search for data within fairly large XMLs. XPath and XQuery are the standard technologies to query XML's. JVM programmers have a choice of multiple libraries to choose from when it comes to XPath. One constraint in my problem was that the program to crunch these XML was a long-running one. So, apart from trying to make the search fast I had to make sure that the CPU/memory requirements were sane. On submitting a XPath search if a library forked many hundred threads, broke the XML into many hundred stubs thus consuming every single ounce of CPU/RAM at disposal on my machine, then it was simply a no-go. Even if such a library turned out to be an order of magnitude faster than the rest.

A look at the XML-XPath JVM library landscape made me shortlist the following for a quick investigation -

This post is a work-in-progress and I will refrain from drawing conclusions. As and when I find more, I shall add. Some passing reader may find the numbers helpful for some other cause in the wild.

Now, the environment details -

  • The approx size of XML's I used was ~ 70MB. That does not make it very large but the complexity of the structure can be the dark variable in XML processing. Even a 5MB XML with small elements, recursive lookups etc (those that people refer to as XML database) can be much harder to search within than a 500MB one which has a straight simple flow (say like Log4J Xml logs). The XML I used was neither as complex as a database or as simple as a log. It was more alike the configuration (more complex than tomcat web.xml but similar) XML files with fairly deep nesting
  • All numbers are mean over run of 30 iterations. they should be treated as ballparks
  • Tests were run on my 4core 8GB Mac OSX Mavericks
  • Java version "1.7.0_51". Scala version "2.11.0"
  • No cpu/memory hungry process running on the system while running the test. It was just a text editor, console, test application and operating system services after a fresh reboot
  • Tests tried with 4 big buckets of Xmx setting - 512M, 1024M, 2048M, 4096M
  • All numbers and screen captures are with jvisualvm. wanted to use jstat but got a little lazy

One important consideration while choosing a XML library is the API. But that is project specific and I leave it out of this comparison.

Results Tabulated

Xmx512m
  Time Taken App CPU Usage GC CPU Usage App Heap Size Heap Used Eden collection count/time spent Old Gen collection count/time spent Eden pattern Survivor pattern Old Gen pattern
scala.xml 240s 70-80% 20% 512M 250-300M 359/15.2s 303/3m18s either 0M or 170M not much usage between 170-340M
javax.xml.xpath does not complete
net.sf.saxon.xpath 67s 60-80% 20% 512M 250-300M 162/6.2s 123/39.3s 0-170M tall spikes consistent use of 57M * 2 stepwise between 0-340M
vtd.xml 11s 26% 0.10% 500M 150-250M 13/138ms 9/262ms between 100-170M very less and infrequent between 80-240M
Xmx1024m
scala.xml 85s 70-80% 20% 1G 250-500M 299/36s 38/14s 0-340M tall spikes 100M consistent 80-600M neat triangles
javax.xml.xpath 57s 50-70% 10-20% 1G 250-500M 197/14s 34/15s 0-340M tall spikes 100M consistent 200-600M neat triangles
net.sf.saxon.xpath 49s 50-70% 10-20% 1G 250-500M 110/12s 34/15s 0-340M tall spikes 100M consistent 200-600M neat triangles
vtd.xml 11s 30% 1-2% 300-800M 200-700M 11/66ms 6/204ms 200-300M 10M 400-600M
Xmx2048m
scala.xml 70s 70-80% 10-20% 2G 0.5-1G 154/27s 26/21s 0-680M tall spikes 100M consistent 200M-1G neat triangles
javax.xml.xpath 59s 40-70% 10-20% 2G 0.5-1G 105/14s 23/17s 0-680M tall spikes 100M consistent 0.3-1.1G
net.sf.saxon.xpath 39s 40-70% 10-20% 2G 0.5-1G 69/10s 18/8s 0-680M tall spikes 200M consistent 300-600M
vtd.xml 11s 26% 0% 0.5-1.25G 0.5-1.25G 14/190ms 6/272ms 600M consistent 200M 1.3G no pattern

JVisualVM Graphs

javax.xpath CPU and Memory
javax.xpath GC
Saxon CPU and Memory
Saxon GC
VTD CPU and Memory
VTD GC
Scala XML Xpath CPU and Memory
Scala XML GC

Code

javax.xpath

import org.w3c.dom.Document;
import java.io.IOException;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.FileInputStream;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import java.util._;
import javax.xml.xpath._
import org.w3c.dom.NodeList

object Main extends App {
	try {
		val builderFactory: DocumentBuilderFactory = DocumentBuilderFactory.newInstance();
		val builder: DocumentBuilder = builderFactory.newDocumentBuilder();	
		val xPath: XPath =  XPathFactory.newInstance().newXPath();
		println((new Date()).toString)
		
		val compexp = xPath.compile("/mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='Dummy']")
		def evalXml() = {
			val document: Document = builder.parse(new FileInputStream("sample.xml"));
    	
			val node = compexp.evaluate(document, XPathConstants.NODESET)
	    	node match {
	    		case n: NodeList => println(n + " at " + (new Date()).toString + " len = " + n.getLength())
	    		case _ => println("typecast to NodeList failed")
	    	}
		}
    		    		
		val t1 = System.currentTimeMillis
		val i = 30
		
		for(j <- 0 to i)
			evalXml();
	    println((new Date()).toString())
		val t2 = System.currentTimeMillis
		println("avg time = " + (t2 - t1)/i)

	} catch {
		case e: Exception=> e.printStackTrace();
	}
}

Saxon

import java.io._;
import java.util._;
import org.w3c.dom.NodeList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathExpression;
import net.sf.saxon.xpath.XPathEvaluator;
import net.sf.saxon.xpath.XPathFactoryImpl;
import org.w3c.dom.Document;
import javax.xml.xpath.XPathConstants;

object SaxonEx extends App {

	val builderFactory: DocumentBuilderFactory = DocumentBuilderFactory.newInstance();
	val builder: DocumentBuilder = builderFactory.newDocumentBuilder();	
    
    val factory = new XPathFactoryImpl();
	val xc = factory.newXPath();
	val xpathCompiler: XPathEvaluator = xc.asInstanceOf[XPathEvaluator];

	val xstring = "//mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='dummy']"
	val expr: XPathExpression  = xpathCompiler.compile(xstring);
    
    println("running SaxonEx:" + (new Date()).toString)
    
    def evalXml() = {
    	val document: Document = builder.parse(new FileInputStream("sample.xml"));

		val node = expr.evaluate(document, XPathConstants.NODESET);
		node match {
	    		case n: NodeList => println(n + " at " + (new Date()).toString + " len = " + n.getLength())
	    		case _ => println("typecast to NodeList failed")
	    }    	
	}
	
	val t1 = System.currentTimeMillis
	val i = 30
		
	for(j <- 0 to i)
		evalXml();
	val t2 = System.currentTimeMillis
	println("avg time = " + (t2 - t1)/i)
	println((new Date()).toString())
}
VTD
import com.ximpleware._;
import com.ximpleware.xpath._;
import java.util._;

object vtd extends App {

	val vg: VTDGen = new VTDGen();
	
	def loopvtd = {
		vg.parseFile("sample.xml", false);
		val vn:VTDNav = vg.getNav();
		val ap:AutoPilot = new AutoPilot(vn);
		ap.selectXPath("/mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='dummy']");
		val x = ap.evalXPath()
		if(x != -1) println("eval returned " + x)
		else println("eval failed")
		
		val value: Int = vn.getText();
		if (value != -1) {
   			val title:String = vn.toNormalizedString(value);
    		println(title);
  		}
	}
	
	val t1 = System.currentTimeMillis
	val i = 30
		
	for(j <- 0 to i)
		loopvtd

	println((new Date()).toString())
	val t2 = System.currentTimeMillis
	println("avg time = " + (t2 - t1)/i)
	
}
Scala
#!/bin/sh
exec scala "$0" "$@"
!#

import scala.xml
import scala.xml._
import java.util._

def findout(filename: String) = {
	val xf = xml.XML.loadFile(filename)
	val cec = (xf \\ "MyResource" filter ( _ \"@displayName" contains Text("Dummy")))
}

println((new Date()).toString())
val t1 = System.currentTimeMillis
val i = 30
for(j <- 0 to i) {
	findout("sample.xml")
	println(s"iteration $j")
}
println((new Date()).toString())
val t2 = System.currentTimeMillis
println("avg time = " + (t2 - t1)/i)

Epilogue

VTD comes across as the fasted XPath of all. Saxon comes next. The standard library implementations of XPath by Java and Scala are much slower. The Scala implementation is not XPath at all and can just be called XPath like. The code is very simplistic to infer a lot from CPU/memory graphs. I have tweaked the code to get a little better inference and intuition. An interested programmer might do the same to get a better idea.