Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

[WIP] Apache Parquet reader #85

Closed
wants to merge 91 commits into from
Closed

[WIP] Apache Parquet reader #85

wants to merge 91 commits into from

Conversation

aocsa
Copy link
Contributor

@aocsa aocsa commented Aug 2, 2018

WORK COMPLETED:

  • As lower level parquet APIs are private and a context object is used when reading a parquet file we have to write our context objects in order to use our custom classes. This context object, contains the row group reader, which in turn contains a column reader, which in turn contains a page reader, which in turn can have decompressors and decoders. So that we have to create our own row groups reader, column reader, page readers and all that, so that we can create our own context object that has the same interface and pass it in to the parquet reader. This means that a lot of our custom row groups reader, column reader, page readers code would were based on parquet-cpp and in the decompression parts from arrow-cpp. This decision make it easy to pass stuff to the gpu and do our decompression there and leave the decompressed data in a gdf_column format.
  • Additionally to write our custom classes to read a parquet file we also implemented: a) Unit Tests using custom read page from parquet file. b) Basic benchmark for custom read page from parquet file 
  • Parquet has a stream of bytes which is the RLE / Bitpacked hybrid compressed data, of known bitwidth (for the bitpacking) and known datatype. The first thing you have is a vlq-encoded value which is of variable number of bytes, which tells you if the following run is RLE or bitpacked and how many values it has.  We put our main efforts on how to decode this schema, because is the most common one by far and one of the most expensive IO tasks in computation terms. The decoding was done in three phases. The preliminary phase where we would parse through the whole compressed data stream collecting all necessary info to process using our custom decoders. This preliminary phase is serial and more ideal for the CPU. The three main phases was done on the GPU.
  1. We collected all the runs and values of all the RLE runs and the runs of the bitpacked runs and put that data into a vector of runs and a vector of values, using a default value (probably 0) for the bitpacked sections. Then we decoded that using a gpu_expanp algorithm using thrust. At this point we have a full sized output into which we put the output of phases 2 and 3.
  2. We collected all the bitpacked runs that have more than 32 elements and break them up into 32 element sections (the remainders are for phase 3). The way parquet does the decoding, 32 element groups always end up bit aligned, so we can process them easily and efficiently in 32 element chunks without having to worry about boundary conditions. This was  done by  thrust::transform functions, where we are passing in a vector of pointer locations to the beginning of each 32 element section and it also receive a vector of pointers to the output locations.
  3. We collected all remainders of the bitpacked runs and have each of them be processed in a kernel that effectively gets one element at a time from the remainder. Similar to phase 2 it was also take in as inputs, vectors of pointers to input and output locations.
  • Additionally to write our custom classes to decode a parquet buffers we also implemented: a) Unit Tests using our custom decoders when reading a parquet file b) Basic GPU vs CPU Benchmark  for bit-packing with 25X of improving.  

STILL NEED TO COMPLETE:

  • Improve the current algorithm for decoding and how to efficiently handle the problem of memory alignment in CUDA. We are going to change our approach, and instead of doing a three phase approach as described above, we will do a two stage approach (stage 1 is the same) where the second stage, we will copy all the bit-packed data into a continuous buffer and decompress it with nvcomp (or something else) and then scatter the results into their final locations. 
  • Improve Parquet API to optimize some processes and make testing easier.
  • Extensive benchmark between the cpu and gpu version for parquet file reading.  

PUBLIC APIs:

libgdf/include/gdf/parquet/api.h

source codes:

libgdf/src/parquet/

libgdf/src/arrow/

unit tests: 

libgdf/src/tests/parquet/decoding/decoding-tests.cpp

libgdf/src/tests/parquet/file_reader/file_reader-test.cpp

libgdf/src/tests/parquet/decoding/file_reader-benchmark.cpp

For more details about the Apache Parquet data structure, compressions and encodings, see here: 

https://docs.google.com/document/d/e/2PACX-1vQArNTYCnn1-Ca1nX72nVOtOk8vn-ARqPujeQpQn5McyS0VcREFsVFA1ExapHucThIGSvT5gxLbSDKl/pub

Afterwards, as you continue to work on that branch, you will want to update your comments in the PR to keep the team updated on your progress.

@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@aocsa aocsa changed the title Parquet reader (WIP) Apache Parquet reader into libgdf Aug 2, 2018
@kkraus14
Copy link
Contributor

kkraus14 commented Aug 2, 2018

add to whitelist

@aocsa aocsa changed the title (WIP) Apache Parquet reader into libgdf [WIP] Apache Parquet reader Aug 2, 2018
@nsakharnykh
Copy link
Contributor

@williamBlazing can we close this one, since we're working on the multi-threaded PR now?

@nsakharnykh
Copy link
Contributor

Closing since this is superseded by #146

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
2 - In Progress Currenty a work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants