Scales Xml Memory Optimisation

Disclaimer

Measuring and comparing XML models is fraught with difficulty and will often vary based on individual XML documents and application behaviour.

Introduction

A basic premise of Scales Xml is to separate the structure of XML from its contents, allowing optimisation on both axes. Due to the Elem also being the child node container scala.xml can only leverage identical subtrees, not individual elements. scala.xml can also optimise via string inlining and namespace map caching, which is much more likely than optimising the subtrees during a parse.

The speed of parsing is also directly related to how much garbage is produced during the parse. For example pre 0.1 versions of Scales used Path directly to build the resulting tree, creating many intermediate Path objects - 99.9% of which were immediately ready for garbage collection. Swapping this into a mutable stack of mutable trees improved parsing performance 20-30%.

Scales takes a flexible approach to memory management allowing the user to tune how much work is performed to reduce allocations during the parse and total resulting memory usage.

It should also be noted that the memory usage for a non deferred Xerces DOM is high enough that parsing a 6MB message (on disk, 12MB in memory, Recon 40,000) is only possible with -Xm135M, whereas Scales can process this with -Xmx102M. Indeed for -Xm135M Scales can parse a Recon of 53,000 yielding a Doc of 46Mb (from 16MB xml).

Although Scales allows customisable memory optimisation strategies the base children IndexedSeq is already optimised for memory usage. Vector is appropriate for large structures (> 30 records) but inappropriate for smaller collections, taking many Mb of unnecessary memory usage.

The same is true of a simple immutable array, the costs are too high for items less than 4 in size.

As such Scales provides an abstraction around array like IndexedSeq - ImmutableArrayProxy, which provides a One, Two and Three Seq, an ImmutableArray for less than 32 and a Vector wrapper for greater than 32. There is also ImmutableArrayAll, which reduces offset information to further reduce memory usage.

A last concession to memory performance (and cpu performance) was the replacement of Either as a container and the creating of EitherLike. EitherLike has the similar properties but is a trait applied to classes, fold and projections are still present but the memory usage from the level of indirection is removed. Tests showed that 10-15% of the memory usage was held purely by the Either ADT, and performance was around 4-5% impact across the board.

Options for memory optimisation

Scales Xmls separation of structure and content in particular allows the following areas of memory optimisation:

During several optimisation rounds it was clear that the largest wins were due to QName and simple Elem (no attributes or namespaces) usage, and as such forms basis for the default parsing behaviour.

QNames often repeat during individual XML documents and, of course, across them in a given domain. QNames can also be shared between Attributes and Elems. There are two direct benefits of optimising at the QName/ level:

  1. Reduced memory consumption
  2. Reduced garbage during parsing

However the time taken to cache QNames is significant for smaller documents (e.g. 150-200 elements) where typically any garbage effects are less likely to impact performance.

As can be seen in the below recon parsing performance diagrams the pure performance behaviour of caching QNames (QNameAndSpeed) is less than simply optimising the children building (MutableVectorLike) for most small documents. When the size of the document increases to a level that provokes garbage collection the default caching behaviour starts to provide a direct parsing speed improvement as well.

Parsing Performance Of Smaller XML
Parsing Performance Of Larger XML

The most important point is that the developer has a choice, when needed, in how to optimise and that Scales allows the developer to easily enhance the default optimisations. The defaults use a thread safe singleton to cache, but developers can also choose to implement a ThreadLocal or a per parse strategy if it better fits. If in doubt profile.

Resulting Sizes

The Recon test (See ParsingPerformance - PerfData.reconDoc) favours the default Scales Xml memory optimisation as it does not contain attributes, however it is taken from a real world scenario. The 40,000 Recon size is chosen to demonstrate the resulting memory sizes (NB obtained by the excellent Yourkit profiler):

Memory Usage

As can be seen the size of the Non Optimised version is very large at 65.3MB, whereas the default optimisation (which is also 15-20% faster on larger documents) is roughly halve the size. NB The resulting QName and Elem caches are 2.24KB and 2.18KB respectively, showing more than a clear space saving.

In fact, in addition to being 10MB less than Scala XML for the same document, it comes in at slightly less memory cost as the Xerces deferred DOM impl.

This benefit in memory reduction can also be extended to full Elem caching but this reduces the overall performance by up to 15%, putting it around (and sometimes below) the parsing performance of Scala XML. The potential memory savings are also limited by how often the attribute or namespace values are identical, and it is often better to simply cache attribute values if there is a restricted range available.

Overall Parsing Performance

The overall parsing performance of Scales Xml, with default Optimisation, is around 10-20% faster than Scala XML and a similar amount slower than JAXP.

The larger the amount of data and the level of repeated structure the faster Scales Xml can parse.

Special Case - Pull Parsing via onQNames

A key feature of Scales is the XML Pull (via StAX) based parsing that leverages Scalaz Iteratees. A shining example of this being onQNames, the recon file being a driving force behind its design.

In the recon example what is actually interesting is three simple hash maps (Int -> Int), as such the memory usage of a onQNames -> map was examined. Both the overall resulting savings (over a full tree) and the runtime memory requirements to parse are examined using the following code:

    object Recon {
      import PerfData.ns

      val Part = List(ns("Recon"), ns("Parts"), ns("Part"))
      val Bom = List(ns("Recon"), ns("BOMs"), ns("BOM"))
      val Rec = List(ns("Recon"), ns("Records"), ns("Record"))

      val id = ns("id")
      val version = ns("version")

    }

    case class Recon(val parts : Map[Int, Int], 
		 val boms : Map[Int, Int],
		 val recs : Map[Int, Int])


    import Recon._
    import Elements.Functions.text

    val xml = pullXml(new java.io.StringReader(s)).it
    foldOnDone( xml )( Recon(Map[Int, Int](), Map[Int, Int](),Map[Int, Int]()), 
		      onDone( List(onQNames(Part), 
				   onQNames(Bom), 
				   onQNames(Rec)) )) {
      (recon, qNamesMatch) => 
	if (qNamesMatch.size == 0)
	  recon
	else {
	  // we expect only one to match in this pattern	  
	  val matched = qNamesMatch.head
	  val qnames = matched._1  // to get an onDone it must be defined
	  val x = matched._2.get
	  // only one child
	  val pair = (text(x.\*(id).head).toInt, text(x.\*(version).head).toInt)

	  qnames match {
	    case Part => recon.copy( parts = recon.parts + pair )
	    case Bom => recon.copy( boms = recon.boms + pair )
	    case Rec => recon.copy( recs = recon.recs + pair )
	  }
	}			   
    
    }

In short the memory requirements for the maps are 13.7MB, whereas the test itself can run in under 45MB, albeit with a high GC overhead. As such thats roughly 25MB required to actually parse the entire document in such a high level api.

_Also worth noting is that it takes 42MB to generate the document_

See the PullTests and intro for examples on how this approach can be best leveraged in your code.