Monday, November 10, 2014

Optimizing Conway's Game of Life with Parallel.For

I've been continuing my coding practice with Conway's Game of Life. Last time, we added a simple UI (just a console application). This time, we'll look at seeing if we can get our calculations to run faster.

As a reminder, you can get to the other articles and download the code from here:

Ready to be Parallelized
When I wrote the game rule method (in our first article), I kept in mind that I may want to parallelize the code at some point. Here's the method we created:

So what makes this conducive to running in parallel? First, I created this as a static method. This means that the method does not belong to a particular instance of a class; it stands on its own.

Next, the method itself does not modify any state or instance data. The method takes in 2 values (the current state and the number of live neighbors) and returns a new value (the new state of the cell). Since "CellState" is an enum (a value type), this means that the return value is a separate value even if we return the parameter unmodified.

Because this method is an atomic operation that does not modify shared state, we can call this method at the same time on different threads without worry.

The Serial Call
When we process our grid, our current method call processes all of the operations serially -- that is, only one call runs at a time. Here's that code:

I mentioned that I wasn't a big fan of having the nested "for" loops here; and that remains true. But we need the indexers into the multi-dimensional array, so we don't have much of a choice with the way things are set up right now.

The outer "for" loop deals with the rows of our array, and the inner "for" loop deals with the columns. Inside the loop, we get the number of live neighbors for a cell and then get the new state for that cell. Once we have that, we assign it to the same X-Y position in the new grid.

Adding a Parallel.For
My first idea was to take the "for" loops and simply change them to "Parallel.For" loops. This didn't feel right, but I wasn't sure where else to start.

The Parallel.For method takes 3 parameters: the start index, the end index, and a delegate that represents the action to be performed. Since I love lambdas, I used a lambda expression for the delegate.

Here's the code:

This is a bit confusing since we have nested Parallel.For loops. The outer loop goes from "0" to the height of the grid, and we use the variable "i" for the indexer (just like our original outer "for" loop).

The body of the lambda expression is the nested Parallel.For loop. This loop goes from "0" to the width of the grid, and we use the variable "j" for the indexer (just like our original inner "for" loop).

Then inside the body of that lambda expression, we call the code to get the number of live neighbors, calculate the new cell state, and update our new grid.

This Smells Bad
I knew when I was writing this that it was not the right way to go. But let's run some metrics to verify that.

There's no way that we would be able to see the difference by simply running our console UI application. To see noticeable differences, we'll need to call the calculations lots of times. So, I created a new project (called "Conway.PerformanceTest") with a console application to do just that.

This application creates a new LifeGrid using the same dimensions as our UI application. The difference is that instead of displaying results, we call the "UpdateState" method many times in a row (based on the "iterations" variable). And we've added a stopwatch so that we can see how long each of these takes.

The first block has our original nested for loop; the last block has our new nested Parallel.For loop. Here are the results of the run:

(Note: even though the source code says "10,000" iterations, this output shows a run of "1,000" iterations.)

What we can see is that I really screwed up the performance. Instead of taking 285 milliseconds to process 1,000 iterations, the "optimized" version took 485 milliseconds -- almost twice as long.

And I'm not very surprised by this result. The nested parallel loops would end up spinning up a lot of threads. I'm sure that the application spends more time managing threads than doing the actual calculation.

So, we've over-parallelized this. Let's try again.

A Bit Less Parallel
For the next try, I decided that I would only use Parallel.For for *one* of the loops -- not for both. Here's the code for that:

This shows that we have a "Parallel.For" for our rows (the "i" dimension), but we have a normal "for" loop for the columns (the "j" dimension). The result is that we will process multiple rows at one time even though we're processing the columns sequentially within each row.

The results of this are much better:

Now we can see that our original nested "for" loop was 287 milliseconds, and our updated single-level "Parallel.For" loop is 192 milliseconds. So we actually cut off some of the processing time here.

More Iterations
But what happens if we add more iterations? Will we see the same performance difference? Let's give it a try with 10,000 iterations.

The results show us that things pretty much scale. When we do 10 times more iterations, our durations are 10 times longer.

What we've seen is that we don't just want to tack "parallel" on to any loop that we have. We have to balance the time it takes to do the calculations with the time required to manage the threads. In this case, since our calculation was quick and easy, we ended up spending more time dealing with threading than we did with our calculations. If our calculation had been more complex (such as finding prime numbers), then it might have made sense to keep the nested parallel loops. But not this code.

Additionally, this code will perform differently based on the "shape" of our grid. We set this up for 25 x 70. This means we have 25 rows and 70 columns. If we to reverse this, then it may be better to set up the "parallel" on the columns rather than the rows.

Finally, we need to look at the other things we are doing inside our parallel loops -- we need to verify that we aren't risking updating the same shared state or risking race conditions.

Inside our loop, we call "GetLiveNeighbors". This deals with instance data: we're looking at the grid to find the neighbors of a particular cell. Then we're counting how many of these neighbors are alive. But we are only reading data here (no updates), so we don't need to worry about race conditions with the data.

Next, we're calling "GetNewState". As we've already seen, this does not deal with any instance data.

Finally, we are setting the cell state in the new grid -- this is where we set "nextState[i, j]". This is potentially dangerous. We are dealing with a shared object here (the "nextState" grid). But since we have the indexers from our loop, we know that we will only update each cell 1 time -- so we don't need to worry about 2 threads trying to update the same array value.

Wrap Up
We need to be careful when we're trying to parallelize our applications. If we're not careful, we may end up making performance worse. One way that we can avoid these problems is to look at the different patterns for parallelizing code. A good resource for this is Parallel Programming with .NET (Jeremy's review). This covers a number of different problem spaces and the techniques that we can use.

This is all just programming by experimentation. And I'm doing this mainly for coding practice -- part of this is exploration, part of this is to try new things, part of this is to find out what doesn't work. This process helps us get a better understanding of how different techniques work so that we can be more effective in our day-to-day programming.

Happy Coding!


  1. (apologies if this is a duplicate -- not sure is blogspot swallowed it when signing me in)

    Funny thing --- After I read just the first paragraph, I started thinking about Life for the first time is many years. The part that always seemed to be the most time-consuming is the GetLiveNeighbors check--- Since it involves looking at 6 other cells for each cell in the grid, it's O(6ij). I then considered an alternative method -- instead of scan the grid, count neighbors and determining new life all in one pass, we do two passes: First, seeking live cell, and broadcast that information to it's neighbors when found. This would make it O(1ij+6x) with x being the number of live cells, which is generally a small fraction of total cells.

    However, having eventually read the full article, we learn that GetLiveNeighbors's key feature is that it's read only/no update, a trait we lose with the ImAlive broadcast --- we can still parallelize, but now we have to worry about simultaneous updates.

    1. Hi, James,
      I'll agree with you about the GetLiveNeighbors method. It does feel a bit off (and part of that is the nested "for" loops). I haven't done any research into optimizing Conway's Game of Life, and I've taken the Mythbusters approach to the optimization -- trial and error (but with fewer explosions (so far)). I took this approach because I'm using this as coding practice and looking to see what solutions I come up with on my own.

      As you noted, having a method that only reads data was important to me. I've been working on ways to bring functional(ish)-style programming into my code since it is more conducive to running in parallel without nasty side effects. And I had that in the back of my mind when I was coding up the method.

      I was thinking about other possibilities -- such as slicing up the grid and sending a copy of those portions to GetLiveNeighbors. But my feeling is that it would be more expensive than what I have now. Since I have the base methods in place, it's probably time for me to start looking at how other folks have optimized the solution. Then I can compare solutions to see how I can code things up better in the future.