Sunday, July 19, 2015

Getting a Bit LINQ-ier

Last time we converted a file load method to be a bit more LINQ-y. The result was a compact and very readable method:

But should we be happy with this? Commenter "TomThumb" is not:
"Have you thought of using File.ReadLines() instead? It returns a lazily evaluated enumerable. It's arguably a bit more LINQy, and perhaps fits your use case a bit better (selecting subsets of the file)?"
Yep, this is probably a good idea. Let's take a closer look and do some informal metrics.

ReadAllLines vs. ReadLines
We'll start by looking at the difference between the methods ReadAllLines() and ReadLines(). Here are the method signatures from the documentation:

This shows us that "ReadAllLines" returns a string array, and "ReadLines" returns an string enumeration.

What's the difference? Well, "ReadAllLines" will read the entire file. "ReadLines" will read one line at a time as we enumerate through the file. If we stop enumerating, then the "ReadLines" will stop reading from the file. So this gives us an opportunity to short-circuit the file read.

And this is important because of how we're using the data from the file:

Pay attention to the "Take" method. Our file has about 40,000 records in it. But we may only "Take" 1,000 records, so there's no need for us to read the rest of the file.

This sounds good in theory, but let's put it into practice and see if we get a performance difference.

Comparing Performance
For this, we'll compare our original method (using "ReadAllLines") with a new method that uses "ReadLines". Here's the updated method:

And I'm curious about differences in the performance based on how much of the file that we read. So we'll do tests reading 1,000 records, 10,000 records, and the entire file (about 40,000 records).

1,000 Records
The application itself already has a duration timer built in, so we'll just use this. Now this shows the duration for the entire process which includes reading the file, processing the data into bitmaps, and loading them into a WPF list box.

The file load process is the fastest part of this process, so I expect to see the biggest differences when I only load a small portion of the data file.

Here are the results when we limit things to 1,000 records:

ReadAllLines - 1,000 Records
ReadLines - 1,000 Records

As expected, this is a pretty dramatic difference. Over several runs, the average with "ReadAllLines" was 1.316 seconds. The average with "ReadLines" was 0.951 seconds.

This makes sense since we don't have to read 39,000 records from the file, just the 1,000 that we're using. But let's keep going. I'm curious as to whether we'll keep that advantage with more records.

10,000 Records
The next tests were with 10,000 records. And the difference is still noticeable:

ReadAllLines - 10,000 Records
ReadLines - 10,000 Records

The average with "ReadAllLines" with 10,000 records was 9.311 seconds. The average with "ReadLines" was 9.042 seconds. Again, the bulk of the work being done here is creating bitmaps and loading the list box. But we see that there is about a 1/4 second difference in our results.

40,000 Records
For the last set of tests, I read the entire file which is about 40,000 records. This is where I was most curious. I wondered if the overhead of having the enumeration would out-weigh the efficiency of reading all the data into memory.

Let's look at the results:

ReadAllLines - 40,000 Records
ReadLines - 40,000 Records

This is where the tables turn -- but just a little bit. The average with "ReadAllLines" when we read the entire file is 37.811 seconds. The average with "ReadLines" is 37.877 seconds.

These results are so close (within a few 1/100ths of a second) that I don't feel right calling one "faster" than the other.

What this tells us is that when we "short-circuit" the file reading process (by only reading a portion of the file), we do get a definite advantage from "ReadLines". And when we read the entire file, we do not get a noticeable performance hit from using "ReadLines".

So, in this particular instance, it would be better for us to use "ReadLines".

I have updated the code in the GitHub project: jeremybytes/digit-display. If you want to play with this code yourself, the file load method is in the FileLoader.cs file, and the read threshold is set in the MainWindow.xaml.cs file.

Here's our file loader method:

Now that we've made this method more LINQ-y by using a Read method that returns an IEnumerable instead of an array, we can look at taking this further.

I'm a big fan of enumerations and IEnumerable. There are some really cool things that we can do (particularly around lazy-loading). In this method, when we call "ToArray" it causes the enumeration to be evaluated. So any lazy-loading goes away right there. We force all of the records to be enumerated.

But we have another option. What if we were to push the enumeration further down in our application? So instead of returning a "string[]" from our "LoadDataStrings" method, we could return "IEnumerable<string>".

How would that affect our downstream code? Well here's where this method is used (in our MainWindow.xaml.cs file):

Notice that "rawData" is a string array, but this could just as easily be an "IEnumerable<string>". This would change how our code runs. This may be good or bad depending on our needs.

If we switch to an enumeration (rather than an array), then our "foreach" will end up going all the way back to our original enumeration from "ReadLines". This means that a line will be read from the file, then processed into a bitmap, then loaded into the list box before the next line is read from the file.

Is this a good thing? I'm not quite sure -- I'll need to think about this some more. One downside to doing things this way is that the file stays open during the entire process -- even while we're loading our UI controls.

With our current code (which calls "ToArray"), we're done with the file as soon as the file load method is finished. So the file can be closed before we start doing any processing on the data.

I'll leave things the way they are (with the arrays) for now. If you have a preference one way or the other, be sure to leave a comment.

Wrap Up
Thanks to "TomThumb" for making me think about things a bit more. Mathias actually does have a sidebar in his book that talks about the difference between "ReadAllLines" and "ReadLines". I appreciate the push to explore this a bit further.

When the question "How LINQ-y can you get?" comes up, the answer is often "a bit LINQ-ier". I'll look forward to exploring this (and a bunch of F#) as I dive deeper.

Happy Coding!

No comments:

Post a Comment