Quick CSV streamer is a high performance CSV parsing library with Java 8 Stream API. The library operates in "zero-copy" mode and only parses what is required by the client. Amount of garbage produced is also optimized, reducing pressure on the garbage collector. Parallel, multi-core parsing is supported transparently via Java Stream API.
Compared to other open source Java CSV parsing libraries Quick CSV achieves speed ups at 2x - 10x range in sequential, single thread, mode. Naturally parallel mode improves performance further. See benchmarking results below for more details.
The library is limited to so called "line-optimal" charsets like UTF-8, US-ASCII, ISO-8859-1 and some others. Such line-optimal charsets have the property that line feed ('\n'), carriage return ('\r'), CSV separator are easily identifiable from other encoded characters.
Available from Maven Central:
<dependency>
<groupId>uk.elementarysoftware</groupId>
<artifactId>quick-csv-streamer</artifactId>
<version>0.2.4</version>
</dependency>
Suppose following CSV file needs to be parsed
Country,City,AccentCity,Region,Population,Latitude,Longitude
ad,andorra,Andorra,07,,42.5,1.5166667
gb,city of london,City of London,H9,,51.514125,-.093689
ua,kharkiv,Kharkiv,07,,49.980814,36.252718
First define Java class to represent the records as follows
public class City {
private final String city;
private final int population;
private final double latitude;
private final double longitude;
...
}
here we will be sourcing 4 fields from the source file, ignoring other 3.
Parsing the file is simple
import uk.elementarysoftware.quickcsv.api.*;
CSVParser<City> parser = CSVParserBuilder.aParser(City::new, City.CSVFields.class).forRfc4180().build();
the parser will be using CSV separators as per RFC 4180, default encoding and will be expecting header as first record in the source. Custom separators, quotes, encodings and header sources are supported.
Actual mapping is done in City
constructor
public class City {
public static enum CSVFields {
AccentCity,
Population,
Latitude,
Longitude
}
public City(CSVRecordWithHeader<CSVFields> r) {
this.city = r.getField(CSVFields.AccentCity).asString();
this.population = r.getField(CSVFields.Population).asInt();
this.latitude = r.getField(CSVFields.Latitude).asDouble();
this.longitude = r.getField(CSVFields.Longitude).asDouble();
}
first CSVFields
enum specifies which fields should be sourced and only these fields will be actually parsed. After that CSVRecordWithHeader
instance is used to populate City
instance fields, refering to CSV fields by enum values.
Of course mapping can also be done outside domain class constructor, just pass different Function<CSVRecordWithHeader, City>
to CSVParserBuilder
.
Resulting stream can be processed in parallel or sequentially with usual Java stream API. For example to parse sequentially on a single thread
Stream<City> stream = parser.parse(source).sequential();
stream.forEach(System.out::println);
By default parser will operate in parallel mode.
Please see sample project for full source code of the above example.
When header contains special characters the fields can not be simply encoded by enum literals. In such cases toString
should be overwritten, for example
enum Fields {
Latitude("City Latitude"),
Longitude("City Longitude"),
City("City name"),
Population("City Population");
private final String headerFieldName;
private Fields(String headerFieldName) {
this.headerFieldName = headerFieldName;
}
@Override public String toString() {
return headerFieldName;
}
}
If header is missing from the source it can be supplied during parser constuction
CSVParserBuilder
.aParser(City::new, City.CSVFields.class)
.usingExplicitHeader("Country", "City", "AccentCity", "Region", "Population", "Latitude", "Longitude")
.build();
About 10% performance improvement compared to normal usage can be achieved by referencing the fields by position instead of name. In this case parser construction is even simpler
CSVParser<City> parser = CSVParserBuilder.aParser(City::new).build();
as enumeration specifying field names is not needed. However now constructor will be using CSVRecord
interface
public City(CSVRecord r) {
r.skipFields(2);
this.city = r.getNextField().asString();
r.skipField();
this.population = r.getNextField().asInt();
this.latitude = r.getNextField().asDouble();
this.longitude = r.getNextField().asDouble();
}
effectively this encodes field order in the CSV source.
Best way to check performance of the library is to run benchmark on your target system with
gradle jmh
reports can be then found in build/reports/jmh.
It is very important to appreciate that performance might vary dramattically depending on the actual CSV content. As a very rough guideline see below sample output of "gradle jmh" on i7 2700k Ubuntu system, which uses cities.txt
similar to example above, expanded to have 3173800 rows and 157 MB in size:
Benchmark | Mode | Cnt | Score | Error | Units |
---|---|---|---|---|---|
OpenCSVParser | avgt | 5 | 2393.921 | ± 262.347 | ms/op |
Quick CSV Parallel with header | avgt | 5 | 205.013 | ± 1.739 | ms/op |
Quick CSV Parallel (advanced) | avgt | 5 | 177.262 | ± 1.739 | ms/op |
Quick CSV Sequential | avgt | 5 | 648.462 | ± 45.991 | ms/op |
Comparison is done with OpenCSV library v3.8, performance of other libraries can be extrapolated using chart from https://github.com/uniVocity/csv-parsers-comparison
Quick CSV Streamer library requires Java 8, it has no other dependencies.
Library is licensed under the terms of GPL v2.0 license. Please contact me if you wish to use this library under more commercially friendly license or want to extend it, for example to add async parsing or support different file formats.