Friday, December 22, 2017

Your Ideas Needed: Other Ways to Run Code in Parallel

My last article, How Does Task in C# Affect Performance?, drew quite a few suggestions on how to improve the performance. If you'd like to contribute your code, now's your chance!

As mentioned in the prior article, the technique of generating a bunch of tasks is a brute force approach. The reason that I like it is that it gets me the UI behavior that I really want. The goal is machine learning for recognizing hand-written digits, but the visualization is the whole point of this application.

Since the process of making a prediction takes a bit of time, I want the UI to update as each item is predicted. Here's an animation of the application running the current code (available on GitHub: https://github.com/jeremybytes/digit-display):


The point of this application is visualization (back to the first iteration): I want to see different algorithms side-by-side to understand what each one gets right and wrong. If they vary in accuracy, are they making the same errors or completely different ones? The speed is also part of that.

Here are the things I like about this particular implementation:
  1. It's easy to tell by the animation above that the Manhattan Classifier runs faster than the Euclidean Classifier.
  2. We don't have to wait for complete results to start analyzing the data.
  3. It gives me an idea of the progress and how much longer the process will take.
These are things that the brute-force method accomplishes. You can look at the previous article to see the code that runs this.

A Different Attempt
Before I did the manual Tasks, I tried to use a Parallel.ForEach. It was a while back, and I remember that I couldn't get it to update the UI the way that I wanted.

I thought I would take another stab at it. Unfortunately, I ended up with an application that went into a "Not Responding" state and updated the UI in a block:


Instead of showing two different algorithms, this shows two different methods of running the tasks in parallel.

On the left, the "Parallel Manhattan Classifier" runs in the "ParallelRecognizerControl". This is a user control that uses a Parallel.ForEach. On the right, the "Manhattan Classifier" runs in the "RecognizerControl". This is a user control that uses the brute-force approach described previously.

A couple things to note:
  1. The application goes into a "Not Responding" state. This means that we have locked our UI thread.
  2. The results all appear at once. This means that we are not putting things into the UI until after all the processes have completed.
This code is available in the "Parallel" branch of the GitHub project, specifically, we can look in the ParallelRecognizerControl.xaml.cs file.


This uses the "Parallel.ForEach" to loop through our data. Then it calls the long-running process: "Recognizer.predict()".

After getting the prediction, we call "CreateUIElements" to put the results into our UI. The challenge is that we need to run this on the UI thread. If we try to run this directly, we run into threading issues. The "Task.Factory.StartNew()" allows us to specify a TaskScheduler that we can use to get back to the UI thread (more on this here: Task, Await, and Asynchronous Methods).

But as we can see from the animation, this does not produce the desired results.

I tried a few approaches, including a concurrent queue (part of that is shown in the commented code). That got a pretty complicated pretty quickly, so I didn't take it too far.

How You Can Help
If you're up for a challenge, here's what you can do.
The application is already configured to run the "ParallelRecognizerControl" in the left panel and the "RecognizerControl" in the right panel. So you should only have to modify the one file.

If you come up with something good (or simply fun, interesting, or elegant), submit a pull request, and we'll take a look at the different options in future articles.

If you don't want to hash out the code yourself, leave a comment with your ideas along with your approach.

Remember: We're looking for something that gives us a UI that updates as the items are processed. There are much faster ways that we can approach this without the visualization. But the visualization is why this application exists.

Happy Coding!

Sunday, December 17, 2017

How Does Task in C# Affect Performance?

When I talk about using Task (and await -- which is a wrapper around Task), often there's a question about performance. The concurrent nature of Task means that there is some overhead. We don't get these things for free.
The good news is that the cost is negligible for the types of things we generally deal with in user space.
To test this out, I did some experimentation with my Digit Recognizer application (GitHub project). In that project, I create a large number of Tasks in order to maximize the usage of the CPU cores on my machine.

There are a couple of caveats with regard to this particular project:
  1. The parallel tasks are CPU-bound, meaning the CPU speed is the constraint. This is different from tasks that are I/O-bound (whether waiting for a file system, network, or database).
  2. The number of tasks is a "reasonable" size. The definition of reasonable will vary based on the application, but we'll see this as we take a closer look at the code.
The Existing Application
This is a machine learning application that is designed to try out different algorithms for recognizing hand-written digits. Here's a typical output (from the master branch of the project):


This shows 2 algorithms side-by-side. On the left is using Manhattan distance and on the right using Euclidean distance. As we can see, the Euclidean distance takes a bit longer, but it is more accurate.

The way that I maximize CPU usage is to split each item into a separate task. That allows me to use all 8 cores on my machine.

Here's the code to run things in parallel (from RecognizerControl.xaml.cs in the DigitDisplay project):


This uses a bit of a brute-force method to running things in parallel. The string array that comes into this method represents the items we're trying to recognize (each string is a separate digit - you can read more about it here: Displaying Images from Pixel Data).

The "foreach" loop grabs each individual string and creates a Task that will run it through the classifier. The "Recognizer.predict()" function is the one that does the heavy lifting.

After kicking off the task (Task.Run creates a Task and runs it as soon as possible), we add a continuation (in the "ContinueWith()" method). This runs the "CreateUIElements()" method that will create a bitmap, pair it with the prediction result, and then display it in the WrapPanel of our UI. This uses the "TaskScheduler.FromCurrentSynchronizationContext" to get back to the UI thread. You can learn more about this from my articles and videos about Task: Task, Await, and Asynchronous Methods.

The last line of the method ("Task.WhenAny()") is there to reset the timer appropriately. This is how we get the "Duration" on the screen.

A Lot of Tasks
This generates *a lot* of Tasks. A Task is created for each item (242 for each classifier), and these are all told to "start now". It's then up to the Task Scheduler to figure out when and where to run each of these tasks.

So I'm creating 484 tasks, and saying "run these when you can".

That can't be good for performance, right?

It runs faster than when it was single-threaded (and that makes a whole lot of sense). But what is it really doing for performance?

Performance Test
I was talking to someone about Tasks and performance, and it made me curious as to how creating all of these Tasks impacts the performance of this application.

I figured that if I stripped away the CPU-intensive processing, what we would have left would be the overhead of using Tasks this way. So I started by creating a "Null Classifier" that didn't to any work.

Initial results were promising:


If you'd like to see this code, then you'll need to switch over to the "NullClassifier" branch of the GitHub project: https://github.com/jeremybytes/digit-display/tree/NullClassifier.

Here's the very complex code for the Null Classifier (from "FunRecognizer.fs" in the "FunctionalRecognizer" project):


The "nullClassifier" takes in an integer array, and always returns "0". This doesn't do any processing, but it's also not very accurate :)

To get a clearer picture of what was going on, I also increased the number of records from 242 to 5,000 (which is actually 10,000 records since we have 2 classifiers). In the results, we can see that the "Manhattan Classifier" took 180 seconds to finish; the "Null Classifier" took 6 seconds to finish.

That implies that there is 6 seconds of overhead for using Task this way.

But that's also a bit misleading.

A More Accurate Test
The 6 seconds didn't seem like that much (relatively-speaking), but it also seemed a bit high for what was going on. But it turned out, I wasn't eliminating all of the processing.

A big part of the processing was creating the bitmaps and putting them into the WrapPanel of the UI. So I went through and commented-out the code that generated the UI elements and ran my test again. This time, the results were pretty surprising (even to me):


This processed the same number of records: 5000. The "Manhattan Classifier" took 164 seconds to complete, so it was a little bit faster. But the "Null Classifier" took no time at all (well, less than a second).

At first, I thought I may have taken out a bit too much code. Maybe the continuations weren't really running. But each record was processed, and we can see that the error count is the same as before: 4515. The error count is incremented in the continuation, so I knew that was still running.
So by eliminating the CPU-intensive operation and the UI-intensive code, I found that the Task overhead is negligible for this type of work.
That was good news to me. Performance is way better than single-threading (3 to 6 times better depending on the machine and the number of cores).

Caveats
It's very easy to mess up parallel programming. If we have interactions between the processes, things can get ugly very quickly. This code was designed to be parallelized from the ground up. This means that I made sure that each Task could be isolated and run without outside dependencies.

I also have a CPU-bound operation. This means that by using Task this way (and letting the Task Scheduler do its job of figuring out the best way to mete out the jobs), I can take advantage of multi-core machines. In fact, I just got a new laptop with 8 cores, and this code runs about twice as fast as on the 4 core machine I was using before.

I'm also running this on a reasonable number of records. In typical usage, this application would process 600 to 800 records -- the number that would fit on the screen. Even when I increased that 10-fold, I had good performance. If I were trying to process hundreds of thousands of records, I would expect this to break down pretty quickly.
For different scenarios, I'd definitely take a look at the advice in Parallel Programming with Microsoft .NET (my review). It has different patterns that we can follow depending on what our needs are.
Update: If you'd like to contribute your own ideas to improving this code, take a look at the next article which describes the application goals: Your Ideas Needed: Other Ways to Run in Parallel. Feel free to submit a pull request.

Wrap Up
I really like using Task. Whenever I tried to do my own threading, bad things would generally happen. That's why I've been a big fan of the BackgroundWorkerComponent in the past and of Task and await today. These abstractions give us a higher-level way of dealing with these concurrent processes. And we can leave it up to the compiler and .NET run-time designers to take care of the details. I'm happy to leave things to people who know way more than I do.

If you want to explore some more, be sure to take at the articles in the README.md of the "digit-display" project on GitHub, and also check out the articles and videos on getting started with Task and await.

Happy Coding!