Tuesday, December 18, 2012

The Demand For Computer Science Graduates

Today, Olivia Leonardi is contributing a post about the growing demand for computer science graduates and the new, open source programs designed to help meet this need. This article is meant to help those who stumble across the posts regarding the technicalities of C# found on NerdBank a way to break into the industry and join in the conversation. Typically Leonardi is a writer for a website offering information about computer science certification.

The Demand For Computer Science Graduates

Trained computer programmers are needed now more than ever. Modern society has experienced an exponential growth of technology that has revolutionized the economy and the workforce and the education system is struggling to keep up. Based on current figures, there are simply not enough individuals who have been properly trained to design and implement these various programs. However, this deficit may improve as a result of web-based programs that allow students to bypass brick-and-mortar university studies, most of which are extremely competitive and drop the majority of students who apply to them, and learn various aspects of computer programming online. Often times, for free.
According to the US Bureau of Labor Statistics, trained computer programmers essentially act as intermediaries between software engineers and computer systems. Many of them are fluent in multiple code languages; the most widely used languages today include Java, C, C++ and HTML. As new software is developed, programmers must test it and, if necessary, debug programs by correcting errors in the code. This all-important task has implications in virtually every area of web-based development; from corporate websites and online business journals to blogs and social media. Project lengths vary; smartphone apps, for instance, take days to develop and optimize, while complex systems may take more than a year to finish.
Programmers are well compensated for their expertise, and also for their relative rarity in the job market. The BLS reported in 2010 that the median wage for computer programmers was $71,380 – one of the highest in the nation, and according to Forbes, the mid-career median salary for those who earn a master’s degree in computer science is $109,000. These high paying jobs are also not few and far between as associated employment opportunities are projected to grow by 22.3 percent.
The problem, according to a recent article in Science, stems from shortcomings in computer science education programs. Technology is evolving at a much more rapid rate than the curricula taught to college-level computer science students. As a result, most receive a very generalized education without learning supplementary skills, such as specific code languages or complex systems programming. Fortunately, aspiring programmers now have access to several web-based outlets if they wish to learn these supplementary skills. Some of these programs include:
Codecademy | Launched in August 2011, this startup offers free lessons for programmers. Current courses include JavaScript Fundamentals, Python, JQuery and Code Year. According to Forbes, the site received 200,000 visitors within 72 hours of launching – and to date has logged more than 50 million completed exercises.
C# Corner | This step-by-step tutorial trains students how to “write and compile C# programs, understand C# syntaxes” and perform other functions related to this commonly used language. The course is free, though previous code experience (such as C#, Pascal or Java) is recommended.
Coursera | Launched in April 2012, this educational platform has served more than 1 million students to date. The site aggregates free courses offered by a number of prominent institutions, including Massachusetts Institute of Technology, Stanford University, Duke University, and Johns Hopkins University. Upcoming courses related to programming include Functional Programming Principles in Scala (offered by École Polytechnique Fédérale de Lausanne) and Heterogeneous Parallel Programming (offered by University of Illinois at Urbana-Champaign).
University of Oklahoma Online | The OU Supercomputing Center for Education & Research currently offers an online tutorial that introduces principles of supercomputing. Though the concepts are complex, the course is designed to approach topics in a user-friendly manner that, with diligent study, will be digestible for students.
Online courses like these allow computer scientists to fill any gaps left by limited high education programs. If they prove to be effective, then the deficit of trained programmers stands to improve considerably in the coming years. 

Wednesday, July 04, 2012

DotNetOpenAuth ships with Visual Studio 2012

Some of you may have already heard the announcement on .NET Rocks (time index 41:30), but in case you haven't: Visual Studio 2012 will include DotNetOpenAuth in all its ASP.NET project templates.  Woot!  Yes, my friends, it has finally happened.

Q&A

Q: With the announcement, Scott Hunter mentions being able to hack into a site that accepted Twitter logins, and may have suggested that the site used DotNetOpenAuth.  He also mentioned that Microsoft "hardened" the library to fix some security issues.  Does DotNetOpenAuth have security holes?

A: Please don't be alarmed.  I worked closely with the Microsoft ASP.NET team who added DotNetOpenAuth to VS2012, and they never found any security holes in the library.  I don't actually know what Scott is referring to here.  Since a web site can certainly use DNOA improperly, security holes in web sites certainly are possible, but not likely from a fault in the library.

Q: Is DotNetOpenAuth a Microsoft project now?

A: No.  DotNetOpenAuth continues to be independently owned and developed, although Microsoft has begun making source contributions to it.  Another example of this is jQuery, which Microsoft ships with ASP.NET as well, but Microsoft doesn't own or directly control it.

Q: Will DotNetOpenAuth continue to get regular releases, or are they now tied to Visual Studio releases?

A: DotNetOpenAuth will continue to ship independently of VS.  The copy of DotNetOpenAuth that ships with Visual Studio 2012 is actually one of DotNetOpenAuth's own past releases.  The way the ASP.NET project templates bundle it is as a NuGet package, so if you create a web project based on these templates, NuGet can help you upgrade to the latest version of DotNetOpenAuth right away.

Q: Is Microsoft contributing to DotNetOpenAuth now?

A: Yes.  You may have noticed the new DotNetOpenAuth.AspNet package available on NuGet.  This is Microsoft's contribution to the DotNetOpenAuth library.  Whereas the rest of DotNetOpenAuth is strictly a protocol library (no specific APIs for individual popular online services), the DotNetOpenAuth.AspNet package adds an assembly which does target all the popular identity providers explicitly, making it easier to add buttons for "log in with Google", "log in with LinkedIn", etc. 

Q: What is DotNetOpenAuth.AspNet? 

A: DotNetOpenAuth.AspNet is Microsoft's front-end to DotNetOpenAuth that the ASP.NET project templates use instead of DotNetOpenAuth directly. Internally, DotNetOpenAuth.AspNet uses DotNetOpenAuth so you're still using the time-tested library underneath.  So if you create a new ASP.NET project in VS2012, you'll be able to switch on DotNetOpenAuth.AspNet by uncommenting a few lines of code.  

Q: How does DotNetOpenAuth.AspNet compare with the core DotNetOpenAuth library?  Should I switch my existing code to use DotNetOpenAuth.AspNet?

A: To keep things simple, the ASP.NET team decided to conceal the core DotNetOpenAuth API behind theirs and they set several DotNetOpenAuth settings to their simple or "stateless" mode values.  This means that things are more likely to work in web farms right out of the box if you're using the DotNetOpenAuth.AspNet frontend, at the cost of a slightly less secure set of defaults.  Some settings are harder to reach if you're using the DotNetOpenAuth.AspNet frontend.  If you have existing code by all means keep it.  The library you're using is still active developed and what DotNetOpenAuth.AspNet uses behind the scenes so you're by no means going to be using obsolete code in the foreseeable future.  If you're writing new code, DotNetOpenAuth is still a fine choice, but if you're just acting as a relying party and want a simpler API to use against only popular identity providers, then DotNetOpenAuth.AspNet is a reasonable choice.

Q: How else is DotNetOpenAuth.AspNet different than the rest of the library?

A: It is less configurable than the rest of the library.  In particular it doesn't offer any way to be configured via your web.config file.  Although since the core DotNetOpenAuth is still active underneath the API, the normal web.config settings you use for DNOA may still apply in some cases.  Also, I don't actively support DotNetOpenAuth.AspNet myself.  Any feedback or issues filed against it I forward to the ASP.NET team, who maintains that project.

Q: So… this is a Good Thing, right?

A: Absolutely. It has never been easier for ASP.NET developers to build web sites that delegate authentication responsibilities to popular identity providers.  This is a boon for web developers, and especially a boon for all our web visitors out there that are now more likely to be able to reuse a few Internet identities instead of maintaining dozens of accounts on all the web sites they visit.

Sunday, June 24, 2012

GC pressure series: hidden boxing

Last time we talked about GC pressure from enumerating collections using their interfaces rather than their concrete types.  It turns out that the garbage produced by using the interfaces is directly caused by boxing, which we’ll explore more in depth and in a broader context here.

Before diving into boxing and when it can implicitly occur, let’s first review some terminology.  If you’re already fluent in value types vs. reference types, feel free to skip to the boxing section.

Value types

In C#, value types are structs.  Primitives you’re very familiar with like integer, float, and Guid are structs, as well as the ones you may define yourself.  If you pass a struct as a parameter to another method, that method gets a copy of that struct – not the original struct.  Consider this example:

static void Main() {
   int seeds = 5;
   AddOneSeed(5);
   Debug.Assert(seeds == 5);
}

static void AddOneSeed(int seeds) {
   seeds++;
}

The AddOneSeed method received a copy of the integer, which is a struct.  Therefore any changes it made do not propagate back to the caller.  Note that if a “ref” keyword is added to the parameter in the method’s signature that the method accesses the caller’s original struct instead of a copy, allowing the method to modify its caller’s value.

Value types store their data directly in their container.  In the above case, the value types appear as local variables within methods, so the data of these value types is stored directly on the stack of the thread executing these methods.  This also means that value types can vary in size, and the container must know exactly how large the value type is so that enough memory can be allocated for it.

Reference types

In C#, reference types are classes.  Reference types store their data on the heap, and anyone holding a reference type is actually just holding a reference: a memory pointer to that object on the heap.  Classes may contain structs as fields.  This means that value types may actually appear on the heap.  Consider this simple example:

class Fruit {
   int seeds;
}

Fruit is a reference type that contains a field that is a value type.  This value type always appears on the heap because it’s within an object on the heap.  The memory allocated on the heap by the Fruit object actually includes the value type directly.  In the case of Fruit above, the object on the heap is 12 bytes in size: 8 for the CLR header that exists for every object, and 4 more bytes for the seeds field.

Boxing: When a value type becomes a reference type

Sometimes a value type must be represented as a reference type.  The CLR allows anything to be represented as a System.Object.  But System.Object is a reference type.  Anyone with a reference to Object expects it to be a pointer to an object on the heap, and the size of Object to the holder of the reference is always 4 bytes regardless of the size of that object on the heap (or 8 bytes in a 64-bit process).  How can a value type, which may vary in size, be assigned to a System.Object, which is always exactly 4 bytes?  Boxing.

Boxing happens when the compiler has to assign a value-type to a reference type.  When boxing, the CLR allocates a new object on the heap just large enough to store the particular value type being boxed, and copies the value type into that object.  Now the reference to that object can be passed around simply as System.Object without anyone knowing how large the object really is. To actually look at the value type again, it must be unboxed, where the value is copied back onto the stack, and possibly into a field typed as the matching value type.

Let’s consider the simplest example of boxing.  We have an integer value type on the stack as a local variable in this method.  Then we assign this value type to a reference type.  IComparable is an interface.  Value types can implement interfaces.  But as soon as you refer to the value type by an interface it implements it gets boxed.  Check this out:

static void Main() {
   int number = 5;
   IComparable comparable = number;
}

The C# compiler compiles the above code to the below IL:

.locals init (
        [0] int32 number,
        [1] class [mscorlib]System.IComparable comparable)
nop 
ldc.i4.5 
stloc.0 
ldloc.0 
box int32
stloc.1

It’s not important that you understand all the IL, but notice the box instruction.  This tells the CLR to allocate an object and copy the value at the top of the stack into it, and then push a reference to that object onto the stack in its place.  The last instruction then assigns that object reference into the local variable typed as an IComparable. 

But wait… if a struct implements an interface already, why does it have to be boxed when it is assigned to a variable typed as that interface?  The answer lies in its size.  Anything can implement an interface (struct or class), and sizes can vary.  But folks holding a reference to an interface have to know in advance how much space to allocate for that reference.  Since value types vary in size, value types don’t qualify.  Instead, when you’re dealing with interfaces you’re always dealing with reference types, meaning that value types get boxed inside an object on the heap as a reference type.

The important thing to realize here is that although the C# code did not have a new keyword anywhere, an object was allocated anyway.  This is under-the-covers magic by C# and the CLR so that things just work, and normally it’s a good thing.  But this is a post on GC pressure, so of course we’re going to study how boxing can work against you, and how you can avoid it.

How to avoid boxing

The easiest way to avoid boxing is to always refer to a value type by its concrete type rather than by System.Object or an interface the value type implements.  You can still call methods on that struct that implement interfaces just by calling those methods directly.  For example, consider these two calls to CompareTo:

static void Main() {
   int number = 5;

   // CompareTo is found on the IComparable interface
   IComparable comparable = number; // box number
   int result = comparable.CompareTo(8); // box 8

   // CompareTo is also a public method on int directly.
   // In fact two overloads of CompareTo are available
   // on int, one that takes System.Object and another
   // that takes an int, so the C# compiler will choose
   // to call the one that takes int, avoiding both 
   // boxes that the above example causes.
   int result2 = number.CompareTo(8);
}

The example demonstrates two things.  First is the discrete second boxing that can happen.  Once we’ve boxed the int to an IComparable, there is only one CompareTo method for the compiler to choose from: CompareTo(object).  That means that in addition to having already boxed the 5, the 8 must also be boxed in order to call CompareTo(object).  On the other hand, leaving 5 as an integer avoids the first box, and allows the CompareTo(8) method to bind to int.CompareTo(int), which means the 8 can also remain as a value type.  To better understand what is happening, consider how the Int32 and IComparable types are defined:

public struct Int32 : IComparable, IFormattable, IConvertible, IComparable, IEquatable {
   public int CompareTo(int value);
   public int CompareTo(object value);
}

public interface IComparable {
   int CompareTo(object obj);
}

As you can see, when you have an IComparable reference, all the compiler can do is call CompareTo(object). But when you have an Int32, you also can call CompareTo(int) and avoid both boxes.

Keep in mind as well that returning a value type from a method whose signature returns an interface also boxes:

static IComparable GetNumber() {
   return 5; // boxed!
}

Some pathological cases

An occasional box here and there isn’t going to hurt the performance of your app.  But there are some things you can do that can lead to enough boxing to really hurt in frequently called code.

Boxing in a loop

Suppose you have no choice but to box a value type to call a method that takes an interface.  If you were calling that method in a loop with the same value type each time, you can take care to avoid boxing for each iteration of the loop.

static void BoxEachIteration() {
   int number = 5;
   for (int i = 0; i < 500; i++) {
      Work(number, i);
   }
}

static void BoxJustOnce() {
   int number = 5;
   IComparable boxedNumber = number;
   for (int i = 0; i < 500; i++) {
      Work(boxedNumber, i);
   }
}

static void Work(IComparable number, int iteration) {
}
Boxing in non-mutating methods

When you implement public methods that don’t mutate state, you should generally be very careful not to box at all.  For example, folks expect that retrieving a value from a dictionary should have no side effects including not allocating memory.  This is important because your cheap, non-mutating methods are often the ones that end up being called in loops which may end up being perf critical to the application.

Earlier this week I investigated a performance problem that I found was due to GC pressure caused by a hashtable-like collection that boxed in its Get method.  Ouch.  Here was the Get method:

public T Get(string key)
{
    return this.Get(new KeyedObject(key));
}

There's a new keyword there, but KeyedObject was a struct, so supposedly no heap allocations there. But it gets more subtle. Take a look at how the struct is defined, and how the Get overload that is being called is defined:

private struct KeyedObject : IKeyed
{
    private string name;
    internal KeyedObject(string name);
    string IKeyed.Key { get; }
}

public T Get(IKeyed item)
{
  // ...
}

Do you see the problem? Although the Get(string) method creates a struct, which doesn’t cause GC pressure, it then passes that struct to a method that accepts an interface.  That struct implements that interface, so while the C# compiler allows it, it first has to box the struct, which means all the work the authoring developer went to by using a struct to avoid allocations didn’t actually avoid the allocation after all.  Since this Get(string) method was being called thousands of times in this perf critical scenario, we were generating garbage like crazy and the GC had to keep interrupting the app to clean up.

One possible fix to this is to add a Get(KeyedObject) overload that does the work directly with the value type.  Then the C# compiler would call that one instead of the one taking the interface, and the boxing would disappear.

Non-generic collections

Generic collections aren’t just sugar.  They can help performance tremendously.  Consider a table of value type pairs.  If you use the non-generic Hashtable collection, where keys and values are typed as System.Object, each key and value must get boxed before storage.  That at least is a somewhat fixed cost – it’s not directly garbage because the box is persistent for as long as that entry is in the collection.  But now think about every lookup operation.  Let’s say you have a hashtable where the keys and values are both integers.  You have a key and you want to look up the value.  You are compelled to box the key first, making it impossible to query for values without allocating memory.

var table = new Hashtable();
table[5] = 8; // boxes 5 and 8
int value = (int)table[5]; // box 5 (again)

If on the other hand you use Dictionary<int, int>, all the internal storage mechanisms and methods are strongly typed to take and return integers rather than System.Object, so no boxing occurs. This can have a dramatic impact on performance due to GC pressure as well as memory availability since each integer takes only 4 bytes in memory instead of the minimum 12 bytes required by every GC heap allocated object.

Should I define my type as struct or class?

That’s a broad question and there is more to consider than GC pressure when making this decision.  In my personal experience, almost every time I’ve nobly started with a struct I’ve ended up converting it to a class.  So my personal default is to use class unless struct is required.  There are several reasons for this, but most pertinent to this post are two things to consider:

  1. Structs allow you to allocate and pass around the values without allocating memory on the heap.  Classes must always be allocated on the heap.
  2. Structs can be boxed many times, but classes will only allocate memory once

Every time you assign a struct such that boxing is required, a new object is allocated.  So the same struct can actually end up acquiring lots of memory and producing lots of garbage.  Whereas if you’d just chosen to use a class, the memory is allocated once and shared across all who reference it.  So be careful when you define structs and ask yourself how folks with that struct will use it.  If the likelihood is high that most instances of the struct will get boxed sometime in their lifetime, consider making it a class instead to help keep the GC pressure down.