Tuesday, May 26, 2009

Caching results of .NET IEnumerable<T> generator methods

If you're already familiar with generator methods and want to jump to intelligent caching of their results, skip further down in this blog post.

In C#, generator methods are methods that use yield return to return an IEnumerable<T> where the elements enumerated over are generated on-demand.  These are useful for a couple of scenarios.  One is deferred execution – not doing work until you absolutely need to. The other is because the method may generated infinite results, or more results than the caller may care to enumerate over – thus avoiding unnecessary computation altogether.

Here’s an example non-generator method. This showcases some simple use cases although it’s obviously not computationally intensive in this example:

/// Returns a sequence of elements as a complete list.
public IEnumerable<int> GetNumbers() {
	List<int> results = new List<int>();
	results.Add(1);
	results.Add(2);
	for (int i = 3; i < 10; i++) {
		results.Add(i);
	}
	return results;
}

And it's "generator method" style equivalent:

/// Returns a sequence of elements, generating them on-demand.
public IEnumerable<int> GetNumbers() {
	yield return 1;
	yield return 2;
	for (int i = 3; i < 10; i++) {
		yield return i;
	}
}

In the first example all the results are generated and collected immediately and returned as a batch. In the second example, the initial call to GetNumbersGenerator() doesn’t execute anything until the Enumerable<int> actually is queried for its first element.  For each enumerated element, only enough code is executed to determine the next element.  The end of the method isn’t executed until the end of the sequence is reached. 

The important thing to note from these examples is that the signatures of the methods are the same.  They both return IEnumerable<int> and yet one defers execution and the other proactively generates all results before returning.  Why not return a IList<int> from the non-generator method?  Because that locks you into this style and also implies that the caller might have modification access to the list.  By returning IEnumerable<T>, the method can be re-implemented as a generator method later if desired, and the caller has no doubt that it is receiving a read-only copy.

Now consider how these methods might be used:

public void PrintNumbers() {
	IEnumerable<int> numbers = GetNumbers();

	// Print to the screen.
	foreach(int element in numbers) {
		Console.WriteLine(element);
	}

	// Print to the log file as well.
	foreach(int element in numbers) {
		Logger.Log(element);
	}
}

If GetNumbers() is implemented as returning a List<int> object, then the work to generate the sequence of numbers is only done once. But if GetNumbers() is implemented as a generator method, the work is done twice: once in each foreach loop.  Since the point of generator methods is to decrease CPU work load, this obviously is counter-productive.  Also, since the caller code is unchanged and the signatures of the methods are the same, this code may be written without any realization of the inefficiency. 

What we need is a way to consume generator methods, leveraging their goodness for deferred execution, without the risk of causing processing to be done twice where that processing is expensive and the cost of storing a cached result is relatively inexpensive.

Introducing IEnumerable<T>.CacheGeneratedResults()

To solve this problem, I have written an extension method called CacheGeneratedResults.  Since it’s an extension method, it automatically is available on all IEnumerable<T> objects.  It preserves the deferred execution goodness of generator methods, but avoids repeat sequence generation if it is enumerated over multiple times.  It does this by caching all generated results as they are pulled by the caller in an hidden List<T>.  To fix the above example, all you have to do is add .CacheGeneratedResults() to the code:

public void PrintNumbers() {
	IEnumerable<int> numbers = GetNumbers().CacheGeneratedResults();

	// Print to the screen.
	foreach(int element in numbers) {
		Console.WriteLine(element);
	}

	// Print to the log file as well.
	foreach(int element in numbers) {
		Logger.Log(element);
	}
}

In the above example, GetNumbers() may or may not be implemented as a generator method, but this code will always execute the code to build up the sequence of numbers only once either way.  This is similar to calling the LINQ extension method IEnumerable<T>.ToList() instead of CacheGeneratedResults(), except that ToList() will always pull all results from a generator method rather than just those needed by the caller, and it always makes a copy of the results as a new List<T>, even if the object you call ToList() on is already an IList<T>, thus doubling the memory you need to store the list.

The CacheGeneratedResults() method checks the IEnumerable<T> instance that is passed to it to see if it needs caching.  Lists, collections, and arrays are passed straight through to the caller with no caching added since they don’t represent extra work for repeated enumerations.  For objects that only expose IEnumerable<T>, as generator methods do, they wrap the generator method’s IEnumerable<T> object by a private type that implements the same interface.  This wrapper intercepts all calls to the IEnumerator<T> and injects the cache between the caller and the live object to protect against repeat processing.

I’ve released the CacheGeneratedResults method and several unit tests for it under the liberal Ms-PL open source license.  You can get it from Github.

Monday, May 11, 2009

Uri.EscapeDataPath and HttpUtility.UrlEncode are NOT the same

For some reason Microsoft defined URI escaping twice: Uri.EscapeDataString and HttpUtility.UrlEncode seem to cover the same need. There’s another pair: Uri.EscapeUriString and HttpUtility.UrlPathEncode which again seem to be redundant with each other. But in particular I found a small difference in behavior between the first two methods that should be called out.

System.Web.HttpUtility.UrlEncode escapes the tilde (~) character. System.Uri.EscapeDataString does not. For every other character their behavior appears to be the same (in my tests anyway). One overall difference though is that HttpUtility.UrlEncode uses lowercase hex encoding whereas Uri.EscapeDataString uses uppercase hex encoding. The RFC 3986 says uppercase should be used.

Incidentally, contrary to the MSDN documentation for Uri.EscapeDataString, turning on the IRI parsing option in the (web) application’s .config file does NOT turn on RFC 3986 compliant URL escaping, so the default RFC 2396 escaping is always used. So since OpenID and OAuth require that RFC 3986 URI escaping be used, I had to write my own RFC 3986 escaping “upgrader” method:

/// <summary>
/// The set of characters that are unreserved in RFC 2396 but are NOT unreserved in RFC 3986.
/// </summary>
private static readonly string[] UriRfc3986CharsToEscape = new[] { "!", "*", "'", "(", ")" };

/// <summary>
/// Escapes a string according to the URI data string rules given in RFC 3986.
/// </summary>
/// <param name="value">The value to escape.</param>
/// <returns>The escaped value.</returns>
/// <remarks>
/// The <see cref="Uri.EscapeDataString"/> method is <i>supposed</i> to take on
/// RFC 3986 behavior if certain elements are present in a .config file.  Even if this
/// actually worked (which in my experiments it <i>doesn't</i>), we can't rely on every
/// host actually having this configuration element present.
/// </remarks>
internal static string EscapeUriDataStringRfc3986(string value) {
 // Start with RFC 2396 escaping by calling the .NET method to do the work.
 // This MAY sometimes exhibit RFC 3986 behavior (according to the documentation).
 // If it does, the escaping we do that follows it will be a no-op since the
 // characters we search for to replace can't possibly exist in the string.
 StringBuilder escaped = new StringBuilder(Uri.EscapeDataString(value));

 // Upgrade the escaping to RFC 3986, if necessary.
 for (int i = 0; i < UriRfc3986CharsToEscape.Length; i++) {
  escaped.Replace(UriRfc3986CharsToEscape[i], Uri.HexEscape(UriRfc3986CharsToEscape[i][0]));
 }

 // Return the fully-RFC3986-escaped string.
 return escaped.ToString();
}