CSV Parsing In .NET Core

It’s almost a right of passage for a junior developer to cludge together their own CSV parser using a simple string.Split(‘,’), and then subsequently work out that there is a bit more to this whole CSV thing than just separating out values by a comma. I recently took a look at my options for parsing CSV’s in .NET Core, here’s what I found!

CSV Gotchas

There are a couple of CSV gotchas that have to be brought up before we dive deeper. Hopefully they should go ahead and explain why rolling your own is sometimes more pain than it’s worth.

A CSV may or may not have a header row. If there is a header row, then the order of the columns is not important since you can detect what is actually in each column. If there is no header row, then you rely on the order of the columns being the same. Any CSV parser should be able to both read columns based on a “header” value, and by index.
Any field may be contained in quotes. However fields that contain a line-break, comma, or quotation marks must be contained in quotes.
To re-emphasize the above, line breaks within a field are allowed within a CSV as long as they are wrapped in quotation marks, this is what trips most people up who are simply reading line by line like it’s a regular text file.
Quote marks within a field are notated by doing double quote marks (As opposed to say an escape character like a back slash).
Each row should have the same amount of columns, however in the RFC this is labelled as a “should” and not a “must”.
While yes, the C in CSV stands for comma, ideally a CSV parser can also handle things like TSV (That is using tabs instead of commas).

And obviously this is just for parsing CSV files into primitives, but in something like .NET we will also be needing :

Deserializing into a list of objects
Handling of Enum values
Custom mappings (So the header value may or may not match the name of the class property in C#)
Mapping of nested objects

Setup For Testing

For testing out these libraries, I wanted a typical scenario for importing. So this included the use of different primitive types (strings, decimals etc), the usage of a line break within a string (Valid as long as it’s contained within quotes), the use of quotes within a field, and the use of a “nested” object in C#.

So my C# model ended up looking like :

class Automobile
{
    public string Make { get; set; }
    public string Model { get; set; }
    public AutomobileType Type { get; set; }
    public int Year { get; set; }
    public decimal Price { get; set; }
    public AutomobileComment Comment { get; set; }
}

class AutomobileComment
{
    public string Comment { get; set; }
}

enum AutomobileType
{
    None,
    Car, 
    Truck, 
    Motorbike
}

And our CSV file ended up looking like :

Make,Model,Type,Year,Price,Comment
"Toyota",Corolla,Car,1990,2000.99,"Comment with a 
line break and "" quotes"

So a couple of things to point out :

Our “Make” is enclosed in quotes but our model is not. Both are valid.
The “Type” is actually an enum and should be deserialized as such
The “Comment” field is a bit of a pain. It contains quotes, it has a line break, and in our C# code it’s actually a nested object.

All of this in my opinion is a pretty common setup for a CSV file import, so let’s see how we go.

CSV Libraries

So we’ve read all of the above and we’ve decided that rolling our own library is just too damn hard. Someone’s probably already solved these issues for us right? As it happens, yes they have. Let’s take a look at what’s out there in terms of libraries. Realistically, I think there is only two that really matter, CSVHelper and TinyCSVParser, so let’s narrow our focus down to them.

CSVHelper

Website : https://joshclose.github.io/CsvHelper/
CSVHelper is the granddaddy of working with CSV in C#. I figured I would have to do a little bit of manual mapping for the comment field, but hopefully the rest would all work out of the box. Well… I didn’t even have to do any mapping at all. It managed to work everything out itself with nothing but the following code :

TextReader reader = new StreamReader("import.txt");
var csvReader = new CsvReader(reader);
var records = csvReader.GetRecords<Automobile>();

Really really impressive. And it handled double quotes, new lines, and enum parsing all on it’s own.

The thing that wow’d me the most about this library is how extensible it is. It can handle completely custom mappings, custom type conversion (So you could split a field into stay a dictionary or a child list), and even had really impressive error handling options.

The fact that in 3 lines of code, I’m basically done with the CSV parsing really points out how good this thing is. I could go on and on about it’s features, but we do have one more parser to test!

Tiny CSV Parser

Website : http://bytefish.github.io/TinyCsvParser/index.html
Next up was Tiny CSV Parser. Someone had pointed me to this one supposedly because it’s speed was supposed to be pretty impressive. We’ll take a look at that in a bit, but for now we just wanted to see how it handled our file.

That’s where things started to fall down. It seems that Tiny CSV doesn’t have “auto” mappings, and instead you have to create a class to map them yourself. Mine ended up looking like this :

class CsvAutomobileMapping : CsvMapping<Automobile>
{
    public CsvAutomobileMapping() : base()
    {
        MapProperty(0, x => x.Make);
        MapProperty(1, x => x.Model);
        MapProperty(2, x => x.Type, new EnumConverter<AutomobileType>());
        MapProperty(3, x => x.Year);
        MapProperty(4, x => x.Price);
        MapProperty(5, x => x.Comment, new AutomobileCommentTypeConverter());
    }
}

class AutomobileCommentTypeConverter : ITypeConverter<AutomobileComment>
{
    public Type TargetType => typeof(AutomobileComment);

    public bool TryConvert(string value, out AutomobileComment result)
    {
        result = new AutomobileComment
        {
            Comment = value
        };
        return true;
    }
}

So, right from the outset there is a bit of an overhead here. I also don’t like how I have to map a specific row index instead of a row heading. You will also notice that I had to create my own type converter for the nested comment object. I feel like this was expected, but the documentation doesn’t specify how to actually create this and I had to delve into the source code to work out what I needed to do.

And once we run it, oof! While it does handle almost all scenario’s, it doesn’t handle line breaks in a field. Removing the line break did mean that we could parse everything else, including the enums, double quotes, and (eventually) the nested object.

The code to actually parse the file (Together with the mappings above) was :

CsvParserOptions csvParserOptions = new CsvParserOptions(true, ',');
var csvParser = new CsvParser<Automobile>(csvParserOptions, new CsvAutomobileMapping());
var records = csvParser.ReadFromFile("import.txt", Encoding.UTF8);

So not a hell of a lot to actually parse the CSV, but it does require a bit of setup. Overall, I didn’t think it was even in the same league as CSVHelper.

Benchmarking

The thing is, while I felt CSVHelper was miles more user friendly than Tiny CSV, the entire reason the latter was recommended to me was because it’s supposedly faster. So let’s put that to the test.

I’m using BenchmarkDotNet (Guide here) to do my benchmarking. And my code looks like the following :

[MemoryDiagnoser]
public class CsvBenchmarking
{
    [Benchmark(Baseline =true)]
    public IEnumerable<Automobile> CSVHelper()
    {
        TextReader reader = new StreamReader("import.txt");
        var csvReader = new CsvReader(reader);
        var records = csvReader.GetRecords<Automobile>();
        return records.ToList();
    }

    [Benchmark]
    public IEnumerable<Automobile> TinyCsvParser()
    {
        CsvParserOptions csvParserOptions = new CsvParserOptions(true, ',');
        var csvParser = new CsvParser<Automobile>(csvParserOptions, new CsvAutomobileMapping());

        var records = csvParser.ReadFromFile("import.txt", Encoding.UTF8);

        return records.Select(x => x.Result).ToList();
    }
}

A quick note on this, is that I tried to keep it fairly simple, but also I had to ensure that I used “ToList()” to make sure that I was actually getting the complete list back, even though this adds quite a bit of overhead. Without it, I get returned an IEnumerable that might not actually be enumerated at all.

First I tried a 100,000 line CSV file that used our “complicated” format above (Without line breaks in fields however as TinyCSVParser does not handle these).

        Method |       Mean | Scaled | Allocated |
-------------- |-----------:|-------:|----------:|
     CSVHelper | 1,404.5 ms |   1.00 | 244.39 MB |
 TinyCsvParser |   381.6 ms |   0.27 |  32.53 MB |

TinyCSVParser was quite a bit faster. (4x infact). And when it came to memory usage, it was way down on CSVHelper.

Let’s see what happens when we up the size to a million rows :

        Method |     Mean | Scaled | Allocated |
-------------- |---------:|-------:|----------:|
     CSVHelper | 14.340 s |   1.00 | 2438.4 MB |
 TinyCsvParser |  4.142 s |   0.29 | 318.65 MB |

It seems pretty linear here in terms of both memory allocated, and the time it takes to complete. What’s mostly interesting here is that the file itself is only 7.5MB big at a million rows, and yet we are allocating close to 2.5GB to handle it all, pretty crazy!

So, the performant status of TinyCSV definitely hold up, it’s much faster and uses far less memory, but for my mind CSVHelper is still the much more user friendly library if you are processing smaller files.

Breaking News

CSV Parsing In .NET Core

CSV Gotchas

Setup For Testing

CSV Libraries

CSVHelper

Tiny CSV Parser

Benchmarking

No comments

Sitelinks search box

Popular Posts

Blog Archive

Tags

Featured Posts

Popular Posts

About Me