[WIP] Apache Parquet reader #85

aocsa · 2018-08-02T14:18:23Z

WORK COMPLETED:

As lower level parquet APIs are private and a context object is used when reading a parquet file we have to write our context objects in order to use our custom classes. This context object, contains the row group reader, which in turn contains a column reader, which in turn contains a page reader, which in turn can have decompressors and decoders. So that we have to create our own row groups reader, column reader, page readers and all that, so that we can create our own context object that has the same interface and pass it in to the parquet reader. This means that a lot of our custom row groups reader, column reader, page readers code would were based on parquet-cpp and in the decompression parts from arrow-cpp. This decision make it easy to pass stuff to the gpu and do our decompression there and leave the decompressed data in a gdf_column format.
Additionally to write our custom classes to read a parquet file we also implemented: a) Unit Tests using custom read page from parquet file. b) Basic benchmark for custom read page from parquet file
Parquet has a stream of bytes which is the RLE / Bitpacked hybrid compressed data, of known bitwidth (for the bitpacking) and known datatype. The first thing you have is a vlq-encoded value which is of variable number of bytes, which tells you if the following run is RLE or bitpacked and how many values it has. We put our main efforts on how to decode this schema, because is the most common one by far and one of the most expensive IO tasks in computation terms. The decoding was done in three phases. The preliminary phase where we would parse through the whole compressed data stream collecting all necessary info to process using our custom decoders. This preliminary phase is serial and more ideal for the CPU. The three main phases was done on the GPU.

We collected all the runs and values of all the RLE runs and the runs of the bitpacked runs and put that data into a vector of runs and a vector of values, using a default value (probably 0) for the bitpacked sections. Then we decoded that using a gpu_expanp algorithm using thrust. At this point we have a full sized output into which we put the output of phases 2 and 3.
We collected all the bitpacked runs that have more than 32 elements and break them up into 32 element sections (the remainders are for phase 3). The way parquet does the decoding, 32 element groups always end up bit aligned, so we can process them easily and efficiently in 32 element chunks without having to worry about boundary conditions. This was done by thrust::transform functions, where we are passing in a vector of pointer locations to the beginning of each 32 element section and it also receive a vector of pointers to the output locations.
We collected all remainders of the bitpacked runs and have each of them be processed in a kernel that effectively gets one element at a time from the remainder. Similar to phase 2 it was also take in as inputs, vectors of pointers to input and output locations.

Additionally to write our custom classes to decode a parquet buffers we also implemented: a) Unit Tests using our custom decoders when reading a parquet file b) Basic GPU vs CPU Benchmark for bit-packing with 25X of improving.

STILL NEED TO COMPLETE:

Improve the current algorithm for decoding and how to efficiently handle the problem of memory alignment in CUDA. We are going to change our approach, and instead of doing a three phase approach as described above, we will do a two stage approach (stage 1 is the same) where the second stage, we will copy all the bit-packed data into a continuous buffer and decompress it with nvcomp (or something else) and then scatter the results into their final locations.
Improve Parquet API to optimize some processes and make testing easier.
Extensive benchmark between the cpu and gpu version for parquet file reading.

PUBLIC APIs:

libgdf/include/gdf/parquet/api.h

source codes:

libgdf/src/parquet/

libgdf/src/arrow/

unit tests:

libgdf/src/tests/parquet/decoding/decoding-tests.cpp

libgdf/src/tests/parquet/file_reader/file_reader-test.cpp

libgdf/src/tests/parquet/decoding/file_reader-benchmark.cpp

For more details about the Apache Parquet data structure, compressions and encodings, see here:

https://docs.google.com/document/d/e/2PACX-1vQArNTYCnn1-Ca1nX72nVOtOk8vn-ARqPujeQpQn5McyS0VcREFsVFA1ExapHucThIGSvT5gxLbSDKl/pub

Afterwards, as you continue to work on that branch, you will want to update your comments in the PR to keep the team updated on your progress.

GPUtester · 2018-08-02T14:18:24Z

Can one of the admins verify this patch?

kkraus14 · 2018-08-02T15:26:56Z

add to whitelist

…quet-reader

…into parquet-reader

nsakharnykh · 2018-09-25T20:26:52Z

@williamBlazing can we close this one, since we're working on the multi-threaded PR now?

nsakharnykh · 2018-10-24T01:37:12Z

Closing since this is superseded by #146

gcca and others added 21 commits July 19, 2018 12:18

[parquet-reader] Add parquet reader wrapper

6cb51df

[parquet-reader] Add column reader

bbe9467

[parquet-reader] Enable read new page call

6ced85b

WIP: add custom decoder

16b40cb

[parquet-reader] Update parquet API to v1.3.1

fc57ccb

[parquet-reader] Read batch as gdf column

3000f89

arrow decoder

a6e7d0e

merge with parquet-reader

7c24364

Merge branch 'parquet-reader' into parquet-decoder

3b9af0e

[parquet-reader] Add gdf column read test

4593968

[parquet-reader] Add file reader by columns benchmark

abe73d3

decoder using host

a384b15

decoder using gpu

79470ea

[parquet-reader] Read spaced batches to gdf column

3ef6ecd

Merge branch 'parquet-reader' into parquet-decoder

4282650

use specific gpu-decoder for int32

819af4e

[parquet-reader] Add API to read a parquet file

5713017

[parquet-reader] Merge from parquet-decoder

7ad9972

[parquet-reader] Fix template definitions for readers

882a296

[parquet-reader] Merger from LibGDF/master

e8068eb

[parquet-reader] Fix testing files

e407912

aocsa changed the title ~~Parquet reader~~ (WIP) Apache Parquet reader into libgdf Aug 2, 2018

gcca added 2 commits August 2, 2018 09:28

[parquet-reader] Move tests to src

9ba5d7e

[parquet-reader] Fix access to parquetcpp repository

6aaaa51

aocsa changed the title ~~(WIP) Apache Parquet reader into libgdf~~ [WIP] Apache Parquet reader Aug 2, 2018

gcca added 3 commits August 2, 2018 11:30

[parquet-reader] Fix benchmark test building

13e27c7

[parquet-reader] Fix build moving tests into src

15ff796

[parquet-reader] Update tests building process

d7bed6a

aocsa and others added 25 commits August 27, 2018 16:42

[parquet-reader]: ReadBatchSpace support on gpu

98940b8

[parquet-reader] Remove unexistent directory

f639c2b

[parquet-reader] check unit test and benchmark

51f7479

changed bitpack remainders implementation

4f88e80

[parquet-reader] Read filtering by row_groups and columns indices

9f6adb7

Merge branch 'parquet-reader' of github.com:BlazingDB/libgdf into par…

19628d5

…quet-reader

[parquet-reader] Merged from master

42bf16d

[parquet-reader] Update to work with arrow 0.9

e6810b5

merged in bitpacking kernels

81d8cb9

[parquet-reader] Fix broken ByIdsInOrder unit test

dbcf578

[parquet-reader] update benchmark

6d2e4b3

Merge branch 'parquet-reader' of https://github.com/BlazingDB/libgdf …

6646f09

…into parquet-reader

[parquet-reader] Add read column method

94ea6a4

fixed an issue with parquet-benchmark test

2950374

[parquet-reader]: fix parquet reader (tested with mortgage data)

fc0a72e

[parquet-reader] fix parquet benchmark

fc85c2e

[parquet-reader] rebase and fix types conversion

b6784de

[parquet-reader]: fix warnings

ea06079

[parquet-reader] Downgrade bison and flex

31326fa

[parquet-reader] Add global ParquetCpp include directories

55ab718

[parquet-reader] Fix compiling warnings

c3f2552

[parquet-reader] fix bitpacking decoder and transform_valid

dc76e3d

[parquet-reader]: merge with last fixes

8bf8311

[parquet-reader]: fix warnings

951cbf9

[parquet-reader]: fix warnings, type convertion

294f345

gcca added 2 commits October 17, 2018 17:05

[parquet-reader] Merged from remote

fe3def3

[parquet-reader] Add API documentation

2e77073

nsakharnykh closed this Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Apache Parquet reader #85

[WIP] Apache Parquet reader #85

aocsa commented Aug 2, 2018 •

edited

Loading

GPUtester commented Aug 2, 2018

kkraus14 commented Aug 2, 2018

nsakharnykh commented Sep 25, 2018

nsakharnykh commented Oct 24, 2018

[WIP] Apache Parquet reader #85

[WIP] Apache Parquet reader #85

Conversation

aocsa commented Aug 2, 2018 • edited Loading

GPUtester commented Aug 2, 2018

kkraus14 commented Aug 2, 2018

nsakharnykh commented Sep 25, 2018

nsakharnykh commented Oct 24, 2018

aocsa commented Aug 2, 2018 •

edited

Loading