Tuesday, May 26, 2009

Caching results of .NET IEnumerable<T> generator methods

If you're already familiar with generator methods and want to jump to intelligent caching of their results, skip further down in this blog post.

In C#, generator methods are methods that use yield return to return an IEnumerable<T> where the elements enumerated over are generated on-demand.  These are useful for a couple of scenarios.  One is deferred execution – not doing work until you absolutely need to. The other is because the method may generated infinite results, or more results than the caller may care to enumerate over – thus avoiding unnecessary computation altogether.

Here’s an example non-generator method. This showcases some simple use cases although it’s obviously not computationally intensive in this example:

/// Returns a sequence of elements as a complete list.
public IEnumerable<int> GetNumbers() {
	List<int> results = new List<int>();
	results.Add(1);
	results.Add(2);
	for (int i = 3; i < 10; i++) {
		results.Add(i);
	}
	return results;
}

And it's "generator method" style equivalent:

/// Returns a sequence of elements, generating them on-demand.
public IEnumerable<int> GetNumbers() {
	yield return 1;
	yield return 2;
	for (int i = 3; i < 10; i++) {
		yield return i;
	}
}

In the first example all the results are generated and collected immediately and returned as a batch. In the second example, the initial call to GetNumbersGenerator() doesn’t execute anything until the Enumerable<int> actually is queried for its first element.  For each enumerated element, only enough code is executed to determine the next element.  The end of the method isn’t executed until the end of the sequence is reached. 

The important thing to note from these examples is that the signatures of the methods are the same.  They both return IEnumerable<int> and yet one defers execution and the other proactively generates all results before returning.  Why not return a IList<int> from the non-generator method?  Because that locks you into this style and also implies that the caller might have modification access to the list.  By returning IEnumerable<T>, the method can be re-implemented as a generator method later if desired, and the caller has no doubt that it is receiving a read-only copy.

Now consider how these methods might be used:

public void PrintNumbers() {
	IEnumerable<int> numbers = GetNumbers();

	// Print to the screen.
	foreach(int element in numbers) {
		Console.WriteLine(element);
	}

	// Print to the log file as well.
	foreach(int element in numbers) {
		Logger.Log(element);
	}
}

If GetNumbers() is implemented as returning a List<int> object, then the work to generate the sequence of numbers is only done once. But if GetNumbers() is implemented as a generator method, the work is done twice: once in each foreach loop.  Since the point of generator methods is to decrease CPU work load, this obviously is counter-productive.  Also, since the caller code is unchanged and the signatures of the methods are the same, this code may be written without any realization of the inefficiency. 

What we need is a way to consume generator methods, leveraging their goodness for deferred execution, without the risk of causing processing to be done twice where that processing is expensive and the cost of storing a cached result is relatively inexpensive.

Introducing IEnumerable<T>.CacheGeneratedResults()

To solve this problem, I have written an extension method called CacheGeneratedResults.  Since it’s an extension method, it automatically is available on all IEnumerable<T> objects.  It preserves the deferred execution goodness of generator methods, but avoids repeat sequence generation if it is enumerated over multiple times.  It does this by caching all generated results as they are pulled by the caller in an hidden List<T>.  To fix the above example, all you have to do is add .CacheGeneratedResults() to the code:

public void PrintNumbers() {
	IEnumerable<int> numbers = GetNumbers().CacheGeneratedResults();

	// Print to the screen.
	foreach(int element in numbers) {
		Console.WriteLine(element);
	}

	// Print to the log file as well.
	foreach(int element in numbers) {
		Logger.Log(element);
	}
}

In the above example, GetNumbers() may or may not be implemented as a generator method, but this code will always execute the code to build up the sequence of numbers only once either way.  This is similar to calling the LINQ extension method IEnumerable<T>.ToList() instead of CacheGeneratedResults(), except that ToList() will always pull all results from a generator method rather than just those needed by the caller, and it always makes a copy of the results as a new List<T>, even if the object you call ToList() on is already an IList<T>, thus doubling the memory you need to store the list.

The CacheGeneratedResults() method checks the IEnumerable<T> instance that is passed to it to see if it needs caching.  Lists, collections, and arrays are passed straight through to the caller with no caching added since they don’t represent extra work for repeated enumerations.  For objects that only expose IEnumerable<T>, as generator methods do, they wrap the generator method’s IEnumerable<T> object by a private type that implements the same interface.  This wrapper intercepts all calls to the IEnumerator<T> and injects the cache between the caller and the live object to protect against repeat processing.

I’ve released the CacheGeneratedResults method and several unit tests for it under the liberal Ms-PL open source license.  You can get it from Github.

2 comments:

  1. This is a wonderful gem of a technique. You might want to consider submitting it to extensionmethod.net? I did find a threading issue in the MoveNext function. The lock should really include the increment and check on cachePosition. As a test, try putting a sleep statement inside the lock before the parent enumerator MoveNext call, and fire off a few threads that enumerate and display the cached list. You'll find that some elements never show up because the parent enumerator was incremented while it was sleeping. The best solution IMO is to check both inside and outside the lock so that you don't take the lock unnecessarily. Thanks again for sharing this code.

    ReplyDelete
  2. Thanks, Marty. I've updated the code.

    ReplyDelete