Thursday, July 16, 2015

Getting More LINQ-y

As I explore functional programming more, I'm learning how I haven't been using LINQ as much as I could be.

Last time, I talked about how I was excited about what I saw in Mathias Brandewinder's book Machine Learning Projects for .NET Developers (Amazon link). I made it through the first chapter, and I'm absorbing lots of stuff about machine learning and F#, but I also came across something simple that I hadn't thought about before:
LINQ can be used much more extensively than I've been using it.
Now, I've been a huge proponent of LINQ, but I usually think about it for things like handling existing data sets. But I don't really think about it as a way to create those data sets.

Functionalizing Code
I'll show what I'm talking about by going back to an article from last year where I needed to load data from a text file (Coding Practice: Displaying Bitmaps from Pixel Data).

Here's my original load from file method:


This is fairly straight-forward (and a bit verbose) code. It gets the data file name from configuration, then reads the file, and returns an array of strings (one for each line in the file).

There are a couple of quirks. Notice the "ReadLine()" with the comment "skip the first line". The first line contains header information, so I just did a read to ignore that.

The other quirk is that I have a parameter for the number of lines that I want to read. If the parameter is supplied, then I only want to read that number of lines (the original data file has 40,000 lines, and it was much easier to deal with smaller data sets). And if the parameter is not supplied, then we read in all the lines.

If you want to look at this code, it's available on GitHub: jeremybytes/digit-display. Look at the initial revision of the file here: FileLoader.cs original.

LINQ to the Rescue!
Now, Mathias shows code to load data from the same file/format (we'll look at that in just a bit). But instead of using a stream reader like I did, he used LINQ pretty much straight across.

So I went back and retro-fitted my file loading method. Here's what's left:


This code does the same thing as the method above. Well, not exactly the same, but close enough for government work.

Rather than using a stream reader, we use "ReadAllLines" to bring in the entire file. (I'll talk about the performance implications of this in just a bit). After that, we just use the standard LINQ methods to get the data we want.

The "Skip(1)" call will skip the header row. The "Take" method will limit the number of records to what's passed in to the parameter. As a side note, notice that I changed the parameter default from "0" to "int.MaxValue". This way if the parameter is omitted, the entire file will be read (well up to the max integer value anyway -- this would be a problem for large data sets (but then so would the rest of this application)).

Then we just use the "ToArray" method to get it into the final format that we want.

I don't know why I never thought of doing things this way before. Now there is a performance difference here. Since "ReadAllLines" reads the entire file, we're bringing in more data than we need, but the heavy-lifting of this application is done after this step, so the performance difference is negligible. But these are the things we need to think about as we make changes to our code.

[Update: 07/20/2015: As suggested in the comments, I explore the difference between "ReadAllLines" and "ReadLines" in Getting a Bit LINQ-ier.]

If you want to look at this code yourself, the latest version of this file is on GitHub: jeremybytes/digit-display. Here's a link to the file: FileLoader.cs current.

The Inspiration
Like I said, the way that Mathias loads the data in his book was the eye-opener. Here's that code (from Chapter 1):


This is doing multiple steps. The second method (ReadObservations) reads the data from the file. In addition to skipping the first line, it does a data transformation using the "Select" method.

And we can see this in the first method. It takes the string data (which is a collection of comma-separated integers) and turns it into an integer array. This process skips the first value because this tells what digit the data represents. Everything after than on the line is the pixel data.

More LINQ-y Goodness
So, I wasn't content with just the file loading part of the application. I also needed to take the string data and convert it into a list of integer values.

Here's my original attempt at that (just part of this particular method):


This isn't bad code. I split the line on the commas, then "foreach" over the elements to convert them from strings to integers. File here: DigitBitmap.cs original.

And here's the updated code:


Here's a link to the file on GitHub: DigitBitmap.cs current.

As far as lines of code is concerned, these are about equivalent. You might notice that the original code doesn't have the "Skip 1" functionality. That's because I took care of it later in the method (which resulted in an off-by-one error that you can read about in the original article).

A Bit of a Stumbling Block
The thing that I hadn't really thought about before was using "Select" to transform the data from string to integer. And this was a bit of a stumbling block for me.

I have seen "Select" mis-used in the past. The code sample that I saw called a method in "Select" that had side effects -- it mutated state on another object rather than simply returning data. This smelled really bad to me when I saw it. Since then, I've been careful and shied away from using "Select" with other methods.

But in this case, it's perfectly appropriate. We are not mutating data. We are not changing state. We are transforming the data -- and this is exactly what "Select" is designed for. This was a real eye-opener, and I'm going to be a bit more creative about how I use LINQ in the future.

Wrap Up
I've been a big fan of LINQ for a really long time. And I really like showing people how to use it (and if you don't believe me, just check out this video series: Lambdas and LINQ in C#). I'm really happy that I can still extend my outlook and find new and exciting things to do with my existing tools.

And I find myself getting sucked more into the functional world. The problems are interesting, the techinques are intriguing, and the people I know are awesome.

I'm pretty sure that by the time I get through this book, I'll be looking for ways to use functional programming (and most likely F#) as a primary tool in my toolbox.

Happy Coding!

3 comments:

  1. Have you thought of using File.ReadLines() instead? It returns a lazily evaluated enumerable. It's arguably a bit more LINQy, and perhaps fits your use case a bit better (selecting subsets of the file)?

    ReplyDelete
    Replies
    1. Interestingly enough, Mathias has a sidebar that talks about File.ReadAllLines() and File.ReadLines() and notes that the distinction is particularly important when dealing with extremely large files that are common in machine learning scenarios.

      You're right that it would be more appropriate here -- especially since we have the Take() method that could stop the enumeration before reaching the end of the file. I'll do some experimentation and post a follow-up. It will also be interesting to compare results with different file sizes (the current file is 73MB).

      Delete
    2. I did some experimentation and File.ReadLines() is the much better choice for this application. It also opens up some downstream options by having an enumeration in place. Read more here: Getting a Bit LINQ-ier.

      Delete