Experiments with XML XPath libraries on JVM
Tweet |
These days I mostly program in Scala. Few weeks ago I ran into a problem to search for data within fairly large XMLs. XPath and XQuery are the standard technologies to query XML’s. JVM programmers have a choice of multiple libraries to choose from when it comes to XPath. One constraint in my problem was that the program to crunch these XML was a long-running one. So, apart from trying to make the search fast I had to make sure that the CPU/memory requirements were sane. On submitting a XPath search if a library forked many hundred threads, broke the XML into many hundred stubs thus consuming every single ounce of CPU/RAM at disposal on my machine, then it was simply a no-go. Even if such a library turned out to be an order of magnitude faster than the rest.
A look at the XML-XPath JVM library landscape made me shortlist the following for a quick investigation -
- scala.xml - Scala’s built-in parser
- javax.xml.xpath
- net.sf.saxon
- vtd-xml
This post is a work-in-progress and I will refrain from drawing conclusions. As and when I find more, I shall add. Some passing reader may find the numbers helpful for some other cause in the wild.
Now, the environment details -
- The approx size of XML’s I used was ~ 70MB. That does not make it very large but the complexity of the structure can be the dark variable in XML processing. Even a 5MB XML with small elements, recursive lookups etc (those that people refer to as XML database) can be much harder to search within than a 500MB one which has a straight simple flow (say like Log4J Xml logs). The XML I used was neither as complex as a database or as simple as a log. It was more alike the configuration (more complex than tomcat web.xml but similar) XML files with fairly deep nesting
- All numbers are mean over run of 30 iterations. they should be treated as ballparks
- Tests were run on my 4core 8GB Mac OSX Mavericks
- Java version “1.7.0_51”. Scala version “2.11.0”
- No cpu/memory hungry process running on the system while running the test. It was just a text editor, console, test application and operating system services after a fresh reboot
- Tests tried with 4 big buckets of Xmx setting - 512M, 1024M, 2048M, 4096M
- All numbers and screen captures are with jvisualvm. wanted to use jstat but got a little lazy
One important consideration while choosing a XML library is the API. But that is project specific and I leave it out of this comparison.
Results Tabulated
Xmx512m | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Time Taken | App CPU Usage | GC CPU Usage | App Heap Size | Heap Used | Eden collection count/time spent | Old Gen collection count/time spent | Eden pattern | Survivor pattern | Old Gen pattern | |
scala.xml | 240s | 70-80% | 20% | 512M | 250-300M | 359/15.2s | 303/3m18s | either 0M or 170M | not much usage | between 170-340M |
javax.xml.xpath | does not complete | |||||||||
net.sf.saxon.xpath | 67s | 60-80% | 20% | 512M | 250-300M | 162/6.2s | 123/39.3s | 0-170M tall spikes | consistent use of 57M * 2 | stepwise between 0-340M |
vtd.xml | 11s | 26% | 0.10% | 500M | 150-250M | 13/138ms | 9/262ms | between 100-170M | very less and infrequent | between 80-240M |
Xmx1024m | ||||||||||
scala.xml | 85s | 70-80% | 20% | 1G | 250-500M | 299/36s | 38/14s | 0-340M tall spikes | 100M consistent | 80-600M neat triangles |
javax.xml.xpath | 57s | 50-70% | 10-20% | 1G | 250-500M | 197/14s | 34/15s | 0-340M tall spikes | 100M consistent | 200-600M neat triangles |
net.sf.saxon.xpath | 49s | 50-70% | 10-20% | 1G | 250-500M | 110/12s | 34/15s | 0-340M tall spikes | 100M consistent | 200-600M neat triangles |
vtd.xml | 11s | 30% | 1-2% | 300-800M | 200-700M | 11/66ms | 6/204ms | 200-300M | 10M | 400-600M |
Xmx2048m | ||||||||||
scala.xml | 70s | 70-80% | 10-20% | 2G | 0.5-1G | 154/27s | 26/21s | 0-680M tall spikes | 100M consistent | 200M-1G neat triangles |
javax.xml.xpath | 59s | 40-70% | 10-20% | 2G | 0.5-1G | 105/14s | 23/17s | 0-680M tall spikes | 100M consistent | 0.3-1.1G |
net.sf.saxon.xpath | 39s | 40-70% | 10-20% | 2G | 0.5-1G | 69/10s | 18/8s | 0-680M tall spikes | 200M consistent | 300-600M |
vtd.xml | 11s | 26% | 0% | 0.5-1.25G | 0.5-1.25G | 14/190ms | 6/272ms | 600M consistent | 200M | 1.3G no pattern |
JVisualVM Graphs
javax.xpath CPU and Memory |
javax.xpath GC |
Saxon CPU and Memory |
Saxon GC |
VTD CPU and Memory |
VTD GC |
Scala XML Xpath CPU and Memory |
Scala XML GC |
Code
javax.xpath
import org.w3c.dom.Document; import java.io.IOException; import org.xml.sax.SAXException; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import java.io.FileInputStream; import javax.xml.xpath.XPath; import javax.xml.xpath.XPathFactory; import java.util._; import javax.xml.xpath._ import org.w3c.dom.NodeList object Main extends App { try { val builderFactory: DocumentBuilderFactory = DocumentBuilderFactory.newInstance(); val builder: DocumentBuilder = builderFactory.newDocumentBuilder(); val xPath: XPath = XPathFactory.newInstance().newXPath(); println((new Date()).toString) val compexp = xPath.compile("/mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='Dummy']") def evalXml() = { val document: Document = builder.parse(new FileInputStream("sample.xml")); val node = compexp.evaluate(document, XPathConstants.NODESET) node match { case n: NodeList => println(n + " at " + (new Date()).toString + " len = " + n.getLength()) case _ => println("typecast to NodeList failed") } } val t1 = System.currentTimeMillis val i = 30 for(j <- 0 to i) evalXml(); println((new Date()).toString()) val t2 = System.currentTimeMillis println("avg time = " + (t2 - t1)/i) } catch { case e: Exception=> e.printStackTrace(); } }
Saxon
import java.io._; import java.util._; import org.w3c.dom.NodeList; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.xpath.XPathFactory; import javax.xml.xpath.XPathExpression; import net.sf.saxon.xpath.XPathEvaluator; import net.sf.saxon.xpath.XPathFactoryImpl; import org.w3c.dom.Document; import javax.xml.xpath.XPathConstants; object SaxonEx extends App { val builderFactory: DocumentBuilderFactory = DocumentBuilderFactory.newInstance(); val builder: DocumentBuilder = builderFactory.newDocumentBuilder(); val factory = new XPathFactoryImpl(); val xc = factory.newXPath(); val xpathCompiler: XPathEvaluator = xc.asInstanceOf[XPathEvaluator]; val xstring = "//mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='dummy']" val expr: XPathExpression = xpathCompiler.compile(xstring); println("running SaxonEx:" + (new Date()).toString) def evalXml() = { val document: Document = builder.parse(new FileInputStream("sample.xml")); val node = expr.evaluate(document, XPathConstants.NODESET); node match { case n: NodeList => println(n + " at " + (new Date()).toString + " len = " + n.getLength()) case _ => println("typecast to NodeList failed") } } val t1 = System.currentTimeMillis val i = 30 for(j <- 0 to i) evalXml(); val t2 = System.currentTimeMillis println("avg time = " + (t2 - t1)/i) println((new Date()).toString()) }
VTD
import com.ximpleware._; import com.ximpleware.xpath._; import java.util._; object vtd extends App { val vg: VTDGen = new VTDGen(); def loopvtd = { vg.parseFile("sample.xml", false); val vn:VTDNav = vg.getNav(); val ap:AutoPilot = new AutoPilot(vn); ap.selectXPath("/mycompany/MyResourceSet/MyResource/MyResourceList/MyResource[@displayName='dummy']"); val x = ap.evalXPath() if(x != -1) println("eval returned " + x) else println("eval failed") val value: Int = vn.getText(); if (value != -1) { val title:String = vn.toNormalizedString(value); println(title); } } val t1 = System.currentTimeMillis val i = 30 for(j <- 0 to i) loopvtd println((new Date()).toString()) val t2 = System.currentTimeMillis println("avg time = " + (t2 - t1)/i) }
Scala
#!/bin/sh exec scala "$0" "$@" !# import scala.xml import scala.xml._ import java.util._ def findout(filename: String) = { val xf = xml.XML.loadFile(filename) val cec = (xf \\ "MyResource" filter ( _ \"@displayName" contains Text("Dummy"))) } println((new Date()).toString()) val t1 = System.currentTimeMillis val i = 30 for(j <- 0 to i) { findout("sample.xml") println(s"iteration $j") } println((new Date()).toString()) val t2 = System.currentTimeMillis println("avg time = " + (t2 - t1)/i)
Epilogue
VTD comes across as the fasted XPath of all. Saxon comes next. The standard library implementations of XPath by Java and Scala are much slower. The Scala implementation is not XPath at all and can just be called XPath like. The code is very simplistic to infer a lot from CPU/memory graphs. I have tweaked the code to get a little better inference and intuition. An interested programmer might do the same to get a better idea.
comments powered by Disqus