Monday, May 11, 2009

Uri.EscapeDataPath and HttpUtility.UrlEncode are NOT the same

For some reason Microsoft defined URI escaping twice: Uri.EscapeDataString and HttpUtility.UrlEncode seem to cover the same need. There’s another pair: Uri.EscapeUriString and HttpUtility.UrlPathEncode which again seem to be redundant with each other. But in particular I found a small difference in behavior between the first two methods that should be called out.

System.Web.HttpUtility.UrlEncode escapes the tilde (~) character. System.Uri.EscapeDataString does not. For every other character their behavior appears to be the same (in my tests anyway). One overall difference though is that HttpUtility.UrlEncode uses lowercase hex encoding whereas Uri.EscapeDataString uses uppercase hex encoding. The RFC 3986 says uppercase should be used.

Incidentally, contrary to the MSDN documentation for Uri.EscapeDataString, turning on the IRI parsing option in the (web) application’s .config file does NOT turn on RFC 3986 compliant URL escaping, so the default RFC 2396 escaping is always used. So since OpenID and OAuth require that RFC 3986 URI escaping be used, I had to write my own RFC 3986 escaping “upgrader” method:

/// <summary>
/// The set of characters that are unreserved in RFC 2396 but are NOT unreserved in RFC 3986.
/// </summary>
private static readonly string[] UriRfc3986CharsToEscape = new[] { "!", "*", "'", "(", ")" };

/// <summary>
/// Escapes a string according to the URI data string rules given in RFC 3986.
/// </summary>
/// <param name="value">The value to escape.</param>
/// <returns>The escaped value.</returns>
/// <remarks>
/// The <see cref="Uri.EscapeDataString"/> method is <i>supposed</i> to take on
/// RFC 3986 behavior if certain elements are present in a .config file.  Even if this
/// actually worked (which in my experiments it <i>doesn't</i>), we can't rely on every
/// host actually having this configuration element present.
/// </remarks>
internal static string EscapeUriDataStringRfc3986(string value) {
 // Start with RFC 2396 escaping by calling the .NET method to do the work.
 // This MAY sometimes exhibit RFC 3986 behavior (according to the documentation).
 // If it does, the escaping we do that follows it will be a no-op since the
 // characters we search for to replace can't possibly exist in the string.
 StringBuilder escaped = new StringBuilder(Uri.EscapeDataString(value));

 // Upgrade the escaping to RFC 3986, if necessary.
 for (int i = 0; i < UriRfc3986CharsToEscape.Length; i++) {
  escaped.Replace(UriRfc3986CharsToEscape[i], Uri.HexEscape(UriRfc3986CharsToEscape[i][0]));
 }

 // Return the fully-RFC3986-escaped string.
 return escaped.ToString();
}

4 comments:

  1. Thanks for sharing Andrew. I have always been using the Encode pair and never notices the other 2 pair.

    Too bad an 'upgrader' is needed for this.

    ReplyDelete
  2. I have also found that UrlDecode converts a plus (+) sign to a space. Thanks for the info.

    http://geekswithblogs.net/mikehuguet/archive/2009/08/16/134123.aspx

    ReplyDelete
  3. Thank you :) I used your conecpt to solve my problem, Details see http://frank-it-beratung.com/2010/11/04/twitter-mit-oauth-und-c-so-gehts-auch-mit-umlauten-und-sonderzeichen/

    ReplyDelete