You can read the data by constructing an instance of ParquetReader class or using one of the helper classes. In the simplest case, to read a file located in c:\data\input.parquet
you would write the following code:
using System.IO;
using Parquet;
using Parquet.Data;
using(Stream fs = File.OpenRead("c:\\data\\input.parquet"))
{
using(var reader = new ParquetReader(fs))
{
DataSet ds = reader.Read();
}
}
DataSet is a rich structure representing the data from the input file.
You can also do the same thing in one line with a helper method
DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet")
Another helper method is for reading from stream:
using(Stream fs = File.OpenRead("c:\\data\\input.parquet"))
{
DataSet ds = ParquetReader.Read(fs);
}
Parquet.Net supports reading portions of files using offset and count properties. In order to do that you need to pass ReaderOptions
and specify the desired parameters. Every Read
method supports those as optional parameters.
For example, to read input.parquet
from rows 10 to 15 use the following code:
var options = new ReaderOptions { Offset = 10, Count = 5};
DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet", null, options);
Parquet.Net allows you to read only a specific set of columns as well. This might come in handly when your dataset consists of a large amount of columns, or just as a parformance optimisation when you know beforehand which data you are interested in.
Reading a subset of columns makes parquet reader completely ignore data in other columns which greatly improves data retreival speed.
To set which columns to read, you have to pass their names to ReaderOptions
, for example:
var options = new ReaderOptions
{
Columns = new[] { "n_name", "n_regionkey" }
};
DataSet ds = ParquetReader.ReadFile("path_to_file.parquet", null, options);
When reading, Parquet.Net uses some defaults specified in ParquetOptions.cs, however you can override them by passing to a Read
method.
For example, to force the reader to treat byte arrays as strings use the following code:
var options = new ParquetOptions { TreatByteArrayAsString = true };
DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet", options, null);