The good news is that the cost is negligible for the types of things we generally deal with in user space.To test this out, I did some experimentation with my Digit Recognizer application (GitHub project). In that project, I create a large number of Tasks in order to maximize the usage of the CPU cores on my machine.
There are a couple of caveats with regard to this particular project:
- The parallel tasks are CPU-bound, meaning the CPU speed is the constraint. This is different from tasks that are I/O-bound (whether waiting for a file system, network, or database).
- The number of tasks is a "reasonable" size. The definition of reasonable will vary based on the application, but we'll see this as we take a closer look at the code.
This is a machine learning application that is designed to try out different algorithms for recognizing hand-written digits. Here's a typical output (from the master branch of the project):
This shows 2 algorithms side-by-side. On the left is using Manhattan distance and on the right using Euclidean distance. As we can see, the Euclidean distance takes a bit longer, but it is more accurate.
The way that I maximize CPU usage is to split each item into a separate task. That allows me to use all 8 cores on my machine.
Here's the code to run things in parallel (from RecognizerControl.xaml.cs in the DigitDisplay project):
This uses a bit of a brute-force method to running things in parallel. The string array that comes into this method represents the items we're trying to recognize (each string is a separate digit - you can read more about it here: Displaying Images from Pixel Data).
The "foreach" loop grabs each individual string and creates a Task that will run it through the classifier. The "Recognizer.predict()" function is the one that does the heavy lifting.
After kicking off the task (Task.Run creates a Task and runs it as soon as possible), we add a continuation (in the "ContinueWith()" method). This runs the "CreateUIElements()" method that will create a bitmap, pair it with the prediction result, and then display it in the WrapPanel of our UI. This uses the "TaskScheduler.FromCurrentSynchronizationContext" to get back to the UI thread. You can learn more about this from my articles and videos about Task: Task, Await, and Asynchronous Methods.
The last line of the method ("Task.WhenAny()") is there to reset the timer appropriately. This is how we get the "Duration" on the screen.
A Lot of Tasks
This generates *a lot* of Tasks. A Task is created for each item (242 for each classifier), and these are all told to "start now". It's then up to the Task Scheduler to figure out when and where to run each of these tasks.
So I'm creating 484 tasks, and saying "run these when you can".
That can't be good for performance, right?
It runs faster than when it was single-threaded (and that makes a whole lot of sense). But what is it really doing for performance?
I was talking to someone about Tasks and performance, and it made me curious as to how creating all of these Tasks impacts the performance of this application.
I figured that if I stripped away the CPU-intensive processing, what we would have left would be the overhead of using Tasks this way. So I started by creating a "Null Classifier" that didn't to any work.
Initial results were promising:
If you'd like to see this code, then you'll need to switch over to the "NullClassifier" branch of the GitHub project: https://github.com/jeremybytes/digit-display/tree/NullClassifier.
Here's the very complex code for the Null Classifier (from "FunRecognizer.fs" in the "FunctionalRecognizer" project):
The "nullClassifier" takes in an integer array, and always returns "0". This doesn't do any processing, but it's also not very accurate :)
To get a clearer picture of what was going on, I also increased the number of records from 242 to 5,000 (which is actually 10,000 records since we have 2 classifiers). In the results, we can see that the "Manhattan Classifier" took 180 seconds to finish; the "Null Classifier" took 6 seconds to finish.
That implies that there is 6 seconds of overhead for using Task this way.
But that's also a bit misleading.
A More Accurate Test
The 6 seconds didn't seem like that much (relatively-speaking), but it also seemed a bit high for what was going on. But it turned out, I wasn't eliminating all of the processing.
A big part of the processing was creating the bitmaps and putting them into the WrapPanel of the UI. So I went through and commented-out the code that generated the UI elements and ran my test again. This time, the results were pretty surprising (even to me):
This processed the same number of records: 5000. The "Manhattan Classifier" took 164 seconds to complete, so it was a little bit faster. But the "Null Classifier" took no time at all (well, less than a second).
At first, I thought I may have taken out a bit too much code. Maybe the continuations weren't really running. But each record was processed, and we can see that the error count is the same as before: 4515. The error count is incremented in the continuation, so I knew that was still running.
So by eliminating the CPU-intensive operation and the UI-intensive code, I found that the Task overhead is negligible for this type of work.That was good news to me. Performance is way better than single-threading (3 to 6 times better depending on the machine and the number of cores).
It's very easy to mess up parallel programming. If we have interactions between the processes, things can get ugly very quickly. This code was designed to be parallelized from the ground up. This means that I made sure that each Task could be isolated and run without outside dependencies.
I also have a CPU-bound operation. This means that by using Task this way (and letting the Task Scheduler do its job of figuring out the best way to mete out the jobs), I can take advantage of multi-core machines. In fact, I just got a new laptop with 8 cores, and this code runs about twice as fast as on the 4 core machine I was using before.
I'm also running this on a reasonable number of records. In typical usage, this application would process 600 to 800 records -- the number that would fit on the screen. Even when I increased that 10-fold, I had good performance. If I were trying to process hundreds of thousands of records, I would expect this to break down pretty quickly.
For different scenarios, I'd definitely take a look at the advice in Parallel Programming with Microsoft .NET (my review). It has different patterns that we can follow depending on what our needs are.Update: If you'd like to contribute your own ideas to improving this code, take a look at the next article which describes the application goals: Your Ideas Needed: Other Ways to Run in Parallel. Feel free to submit a pull request.
I really like using Task. Whenever I tried to do my own threading, bad things would generally happen. That's why I've been a big fan of the BackgroundWorkerComponent in the past and of Task and await today. These abstractions give us a higher-level way of dealing with these concurrent processes. And we can leave it up to the compiler and .NET run-time designers to take care of the details. I'm happy to leave things to people who know way more than I do.
If you want to explore some more, be sure to take at the articles in the README.md of the "digit-display" project on GitHub, and also check out the articles and videos on getting started with Task and await.