Streaming Into LINQ to XML Using C# Custom Iterators and XmlReader

In this post, I’ll show you how to create LINQ to XML queries that don’t require you to first load and cache XML content into the in-memory LINQ to XML DOM (that is, without first populating an XDocument or XElement query source), but instead operate against an input stream implemented with a C# custom iterator method and an old-fashioned XmlReader object.

LINQ to XML

LINQ to XML, introduced with the .NET Framework 3.5, is a huge win for developers working with XML in any shape or form. Whether XML is being queried, parsed, or transformed, LINQ to XML can almost always be used as an easier alternative to previous technologies that are based on XML-specific languages (XPath, XQuery, XSLT).

At the center of the LINQ to XML stage lies a new DOM for caching XML data in memory. This object model, based on either a root XDocument object or independent XElement objects, represents a major improvement over the older XmlDocument-based DOM in numerous ways (details of which will serve as the topic for a future post). In terms of querying, the XDocument and XElement objects provide methods (such as Descendants) that expose collections of nodes which can be iterated by a LINQ to XML query.

Consuming Sequences with LINQ

A common misconception by many developers learning LINQ is that LINQ only consumes collections. That is not surprising, since one of first benefits developers come to understand is that LINQ can be used to easily query an input collection without coding a foreach loop. While this is certainly true, it’s a very limited view of what can really be accomplished with LINQ.

The broader view is that LINQ works against sequences, which are often — but certainly not always — collections. A sequence can be generated by any object that implements IEnumerable<T> or any method that returns an IEnumerable<T>. Such objects and methods provide iterators that LINQ calls to retrieve one element after another from the source sequence being queried. For a collection, the enumerator simply walks the elements of the collection and returns them one at a time. But you can create your own class that implements IEnumerable<T> or your own custom iterator method which serves as the enumerator that returns a sequence based on something other than a collection.

The point? LINQ is not limited to querying in-memory collections. It can be used to query any sequence, which is fed to the LINQ query by enumerator methods exposed by all classes that implement IEnumerable<T>. What this means is that you don’t necessarily need to load an entire XML document into an in-memory cache before you can query it using LINQ to XML. If you are querying very large documents in order to extract just a few elements of interest, you can achieve better performance by having your LINQ to XML query stream through the XML—without ever caching the XML in memory.

Querying Cached XML vs. Streamed XML

How do you write LINQ to XML queries that consume a read-only, forward-only input stream instead of a pre-populated in-memory cache? Easy. Refer to a custom iterator method instead of a pre-populated XDocument or XElement in the from clause of your query.

For example, consider the following query:

var xd = XDocument.Load("Customers.xml");
var domQuery =
  from c in xd.Descendants("Customer")
  where (string)c.Attribute("Country") == "UK"
  select c;

Now compare that query with this version:

var streamQuery =
  from c in StreamElements("Customers.xml", "Customer")
  where (string)c.Attribute("Country") == "UK"
  select c;

Both versions query over the same XML and produce the same output result (a sequence of customer nodes from the UK), but they consume their input sequences in completely differently ways. The first version first loads the XML content into an XDocument (in-memory cache), and then queries the collection of nodes returned by the Descendants method (all <Customer> nodes) for those in the UK. The larger the XML document being queried, the more memory is consumed by this approach. The second version queries directly against an input stream that feeds the XML content as a sequence using a custom iterator method named StreamElements. This version will consume no more memory for querying a huge XML file than it will for a tiny one. Here’s the implementation of the custom iterator method:

private static IEnumerable<XElement> StreamElements(
  string fileName,
  string elementName)
{
  using (var rdr = XmlReader.Create(fileName))
  {
    rdr.MoveToContent();
    while (rdr.Read())
    {
      if ((rdr.NodeType == XmlNodeType.Element) && (rdr.Name == elementName))
      {
        var e = XElement.ReadFrom(rdr) as XElement;
        yield return e;
      }
    }
    rdr.Close();
  }
}

Understanding C# Custom Iterators

By definition, this method is a custom iterator since it returns an IEnumerable<T> and has a yield return statement in it. By returning IEnumerable<XElement> specifically (just like the Descendants method in the cached DOM version of the query does), this custom iterator is suitable as the source for a LINQ to XML query. The implementation of this method is a beautiful demonstration of how seamlessly new technology (LINQ to XML) integrates with old technology (the XmlReader object has been around since .NET 1.0). Let’s examine the method piece by piece to understand exactly how it works.

When the query first begins to execute, and LINQ needs to start scanning the input sequence, the StreamElements custom iterator method is called with two parameters. The first parameter is the name of the input XML file (“Customers.xml”) and the second parameter is the element of interest to be queried (“Customer”, for each <Customer> node in the XML file).  The method then opens an “old-fashioned” XmlReader against the XML file and advances the stream to the beginning of its content by invoking MoveToContent. It then enters a loop that reads from the stream one element at a time. Each element is tested against the second parameter (“Customer”, in this example). Each matching element (which would be all <Customer> elements) is converted into in XElement object by invoking the static XElement.ReadFrom method against the reader. The XElement object representing the matching node is then yield returned to the LINQ query.

As the query continues to execute, and LINQ needs to continue scanning the input sequence for additional elements, the StreamElements custom iterator method continues execution right after the point at which it yield returned the previous element, rather than entering at the top of the method like an ordinary method would. In this manner, the input stream advances and returns one matching XElement after another while the LINQ query consumes that sequence and filters for UK to produce a new output sequence with the results. When the end of the input stream is reached, the reader’s Read method will return false, which will end the loop, close the reader, and finally exit the method. When the custom iterator method exits, that signals the LINQ query that there are no more input elements in the sequence and the query completes execution at that point.

One important point that may seem obvious but is worth calling out anyway, is that running streaming queries more than once results in reading through the entire stream each time. So to best apply this technique, you should try to extract everything you need from the XML content in one pass (that is, with one LINQ query). If it turns out that you need to query over the stream multiple times, you’ll need to reconsider matters to determine if you aren’t better off caching the XML content once and then querying over the in-memory cache multiple times.

Custom iterators are a very powerful C# language feature, not necessarily limited for use with LINQ. Streaming input into LINQ to XML queries as an alternative to using a cached input source is just one example, but there are numerous others ways to leverage custom iterators. To learn how they can be used with SQL Server 2008 Table-Valued Parameters to marshal an entire business object collection to a stored procedure in a single round-trip, view my earlier blog post that explains the details: SQL Server 2008 Table-Valued Parameters and C# Custom Iterators: A Match Made In Heaven!

Advertisements