Wednesday, January 25, 2017

Machine Learning Digit Recognizer: Automatically Recognizing Errors

As an ongoing project, I've been working on an application that lets us visualize the results of recognizing hand-written digits (a machine learning project). To see a history, check out the "Machine Learning (sort of)" section in my Functional Programming Articles. You can grab the code from GitHub: jeremybytes/digit-display.

One thing that has kept me from working with this project is that looking for mistakes is tedious. I had to scan through the results which included the bitmap and the computer prediction and determine if it was correct. I had a bit of a brainstorm on how I could automate this process, and that's what this is about.

Here's the result:

All of the "red" marks are done automatically. No more manual error counting.

Using a Single File
The key to getting this to work was to use a single data file instead of two data files.

Originally, I used a separate training set (approximately 42,000 records) and validation set (approximately 28,000 records). These files came from the Kaggle challenge that I talked about way back in my original article (Wow, have I really been looking at this for 2-1/2 years?). Both files contain the bitmap data for the hand-written digits. But the training set also includes a field for the actual value represented.

Rather than using both files, I decided to use the training set for both files. This way, when I could check the actual value to see if the prediction was correct.

There is a bit of a problem, though. If I used the same records for both training and validation, I would end up with 100% accuracy because the records are exactly the same. So the trick was to take a  single file and cut out the bit that I wanted to use for the validation set and exclude it from the training set.

Here's an example. Let's say that I had a training set with 20 values:

5, 3, 6, 7, 1, 8, 2, 9, 2, 6, 0, 3, 3, 4, 2, 1, 7, 0, 7, 2, 5

What we can do is carve up the set so that we use some of it for training and some for validation. So, we can use 5 numbers for validation starting at an offset of 2:

5, 3, [6, 7, 1, 8, 2,] 9, 2, 6, 0, 3, 3, 4, 2, 1, 7, 0, 7, 2, 5 

This leaves us with 2 separate sets:

Training: 5, 3, 9, 2, 6, 0, 3, 3, 4, 2, 1, 7, 0, 7, 2, 5 
Validation: 6, 7, 1, 8, 2

This is obviously a simplified example. In our real file of 42,000 records, we'll be carving out a set of 325 records that we can work with. This still leaves us with lots of training data.

Note: This code is available in the "AutomaticErrorDetection" branch in the GitHub project: AutomaticErrorDetection branch.

Configuration Values
To hold the values, I opted to use the App.config file. I'm not very happy with this solution, but it works. I would much rather be able to select the values in the application itself, but that was troublesome. I'll come back to talk about this a bit later.

Here's App Settings section of the configuration file:

This shows that our training file and data file now point to the same thing ("train.csv"). It also shows that we want to use 325 records for prediction, and that we want to start with record number 1,000 in the file.

Loading Training Data
This means that when we load up the records to use as a training set, we want to take the first 1,000 records, then skip 325 records, and then take the rest.

Here is the original function to load up the training set (in "FunRecognizer.fs")

Original Loader - All Records

This just loaded up all of the records in the file.

Here are the updated functions to load up just the parts of the file we want:

New Loader - Skip Over the Records to be Validated

First we pull some values out of the configuration file, including the file name, the offset, and the record count. One thing I like here is that we can pipe the values for "offset" and "recordCount" to "Int32.Parse" to convert them from string values to integer values really easily.

Then we load up the data. By using "Array.concat", we can take two separate arrays and combine them into a single array. In this case, we're looking at the data in the file. The first part resolves to "data.[1..1000]" which would give us the first 1000 records. The second part resolves to "data.[1000+325+1..]" which is "data.[1326..]". Since we don't have a second number, this will start at record 1326 and just read to the end of the array (file).

The effect is that when we load up the training set, we skip over 325 records that we can then use as our validation set.

Loading Validation Data
We do something similar on the validation side. When we load up the data, we'll use the same file, and we'll pull out just the 325 records that we want.

Here's the method for that (in "FileLoader.cs"):

This starts by grabbing the values from configuration.

Then using a bit of LINQ, we Skip the first 1,000 records (in the case of the configuration shown above), then we Take 325 records (also based on the configuration). The effect is that we get the 325 records that were *not* used in the training set above.

By doing this, we can use the same file, but we don't have to be concerned that we're using the same records.

Marking Errors
There were a few more changes to allow the marking of errors. We can pull out the actual value from the data in the file and use that to see if our prediction is correct.

I added a parameter to the method that creates all of the UI elements (in "RecognizerControl.xaml.cs"):

The new parameter is "string actual". I'm using a string here rather than an integer because the prediction coming from our digit recognizer is a string.

Then in the body of this method, we just see if the prediction and actual are the same:

If they don't match, then we'll set the background color and increment the number of errors. There is a little more code here, but this gives enough to show what direction we're headed.

The result is that the errors are marked automatically. This saves a huge amount of time (and tedium) since we don't have to mark them manually. (Plus, I was never confident that I caught them all.)

Exploring Data
I wanted the values to be configurable because it's really easy to tune algorithms to work with a particular set of data. I wanted to be able to easily try different sets of data. Even with the simple algorithms that we have in this code, we can see differences.

If we pick an offset of 1,000, we get these results:

But if we pick an offset of 10,000, we get these results:

With the first set of data, the Euclidean Classifier looks a lot more accurate. But with the second set of data, the Manhattan Classifier looks to be more accurate. So I want to be able to try different validation sets to make sure that I'm not tuning things just for specific values.

I also do like the side-by-side comparison. This shows if the errors are the same or different.

Easily Changing Values
In earlier versions of this application, the "Record Count" and "Offset" values on the screen were editable values. This made it really easy to change the values and click "Go" to see the new results. But that's not possible when we're using the configuration file. So why the change?

On my first attempt, I tried to figure out how to get the values from the screen to the relevant parts of the application. This was pretty easy to do in the validation set code, but it got a bit trickier to get it into the training set code.

The training set code is nested several levels deep in the functional code. This meant adding some parameters to the "reader" function that is shown above. But because this is called within a function that is called within a function that is called within a function, how could I get those values in there?

I tried adding a parameter and then bubbling that parameter up. This became problematic from a syntactical standpoint, but I also wasn't happy exposing those values in that way. It seemed very "leaky".

So an easy way to fix this was to store the values in a central location that everyone could access. And that's why I created the configuration file.

Future Updates
Now that I have this working, I'm going to do a bit more experimentation. I would like to have the ability to change the values without having to restart the application. So, I'm going to put back the editable text boxes and see if I can work with a separate object to hold these values.

This would have to be in a separate project to prevent circular dependencies. And I would also want to figure out how to use the same immutable object for the training code and validation code. This will ensure that both are using the same values. If they use different values, then things will fall apart pretty quickly.

Wrap Up
It's always fun to play and experiment with different things. Now that I've made this application easier to work with (and much less tedious), I'm more likely to explore different algorithms. I've done a little bit of experimentation in the past, but it will be easier to see results now.

As I continue, I'll look at converting more of this code to F# (I still need more practice with that). And we'll see if we can teach the computer to get better at recognizing those hand-written digits.

Happy Coding!

No comments:

Post a Comment