Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Latest commit

 

History

History
75 lines (52 loc) · 2.48 KB

reading.md

File metadata and controls

75 lines (52 loc) · 2.48 KB

Reading Data

You can read the data by constructing an instance of ParquetReader class or using one of the helper classes. In the simplest case, to read a file located in c:\data\input.parquet you would write the following code:

using System.IO;
using Parquet;
using Parquet.Data;

using(Stream fs = File.OpenRead("c:\\data\\input.parquet"))
{
	using(var reader = new ParquetReader(fs))
	{
		DataSet ds = reader.Read();
	}
}

DataSet is a rich structure representing the data from the input file.

You can also do the same thing in one line with a helper method

DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet")

Another helper method is for reading from stream:

using(Stream fs = File.OpenRead("c:\\data\\input.parquet"))
{
	DataSet ds = ParquetReader.Read(fs);
}

Reading parts of file

Offsets and Counts

Parquet.Net supports reading portions of files using offset and count properties. In order to do that you need to pass ReaderOptions and specify the desired parameters. Every Read method supports those as optional parameters.

For example, to read input.parquet from rows 10 to 15 use the following code:

var options = new ReaderOptions { Offset = 10, Count = 5};
DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet", null, options);

Limiting by columns

Parquet.Net allows you to read only a specific set of columns as well. This might come in handly when your dataset consists of a large amount of columns, or just as a parformance optimisation when you know beforehand which data you are interested in.

Reading a subset of columns makes parquet reader completely ignore data in other columns which greatly improves data retreival speed.

To set which columns to read, you have to pass their names to ReaderOptions, for example:

var options = new ReaderOptions
{
   Columns = new[] { "n_name", "n_regionkey" }
};

DataSet ds = ParquetReader.ReadFile("path_to_file.parquet", null, options);

Using format options

When reading, Parquet.Net uses some defaults specified in ParquetOptions.cs, however you can override them by passing to a Read method.

For example, to force the reader to treat byte arrays as strings use the following code:

var options = new ParquetOptions { TreatByteArrayAsString = true };
DataSet ds = ParquetReader.ReadFile("c:\\data\\input.parquet", options, null);