Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

[Review] Parquet reader multithread #146

Open
wants to merge 109 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
6cb51df
[parquet-reader] Add parquet reader wrapper
gcca Jul 17, 2018
bbe9467
[parquet-reader] Add column reader
gcca Jul 18, 2018
6ced85b
[parquet-reader] Enable read new page call
gcca Jul 20, 2018
16b40cb
WIP: add custom decoder
aocsa Jul 20, 2018
fc57ccb
[parquet-reader] Update parquet API to v1.3.1
gcca Jul 23, 2018
3000f89
[parquet-reader] Read batch as gdf column
gcca Jul 25, 2018
a6e7d0e
arrow decoder
aocsa Jul 26, 2018
7c24364
merge with parquet-reader
aocsa Jul 26, 2018
3b9af0e
Merge branch 'parquet-reader' into parquet-decoder
aocsa Jul 26, 2018
4593968
[parquet-reader] Add gdf column read test
gcca Jul 26, 2018
abe73d3
[parquet-reader] Add file reader by columns benchmark
gcca Jul 27, 2018
a384b15
decoder using host
aocsa Jul 27, 2018
79470ea
decoder using gpu
aocsa Jul 27, 2018
3ef6ecd
[parquet-reader] Read spaced batches to gdf column
gcca Jul 30, 2018
4282650
Merge branch 'parquet-reader' into parquet-decoder
aocsa Aug 1, 2018
819af4e
use specific gpu-decoder for int32
aocsa Aug 1, 2018
5713017
[parquet-reader] Add API to read a parquet file
gcca Aug 2, 2018
7ad9972
[parquet-reader] Merge from parquet-decoder
gcca Aug 2, 2018
882a296
[parquet-reader] Fix template definitions for readers
gcca Aug 2, 2018
e8068eb
[parquet-reader] Merger from LibGDF/master
gcca Aug 2, 2018
e407912
[parquet-reader] Fix testing files
gcca Aug 2, 2018
9ba5d7e
[parquet-reader] Move tests to src
gcca Aug 2, 2018
6aaaa51
[parquet-reader] Fix access to parquetcpp repository
gcca Aug 2, 2018
13e27c7
[parquet-reader] Fix benchmark test building
gcca Aug 2, 2018
15ff796
[parquet-reader] Fix build moving tests into src
gcca Aug 2, 2018
d7bed6a
[parquet-reader] Update tests building process
gcca Aug 2, 2018
92d89e9
[parquet-reader] Add conda dependencies for Thrift
gcca Aug 3, 2018
f56a978
[parquet-reader] Check gdf dtype from parquet type
gcca Aug 6, 2018
9043c7a
[parquet-reader] Apply batch spaced reading on tests
gcca Aug 6, 2018
9d2275e
[parquet-reader] Add column filter from file
gcca Aug 7, 2018
d0b265c
[parquet-reader] Add read to gdf column method
gcca Aug 7, 2018
3b464bd
[parquet-reader] Remove ReadGdfColumn method
gcca Aug 7, 2018
f92a931
decode bitpacking data using pinned memory
aocsa Aug 7, 2018
d25db66
Merge branch 'parquet-reader' of https://github.com/BlazingDB/libgdf …
aocsa Aug 7, 2018
1716e81
[parquet-reader] Add parquet target for linking
gcca Aug 8, 2018
9e39227
decode bitpacking data using pinned memory: merge
aocsa Aug 8, 2018
ab07b56
bitpacking decoding for all types
aocsa Aug 9, 2018
5ebc08c
start gpu benchmark for parquet reader
aocsa Aug 13, 2018
54a63a1
improve copy scheme from pinned memory to device memory
aocsa Aug 15, 2018
7ee8760
init benchmark for parquet reader
aocsa Aug 16, 2018
2ad9c25
wip: decode using only gpu
aocsa Aug 21, 2018
02c1132
gdf_column in device and benchmark for parquet reader
aocsa Aug 21, 2018
8be8e9e
implemented new expand function. Commented out problematic tests. sta…
Aug 21, 2018
273e17d
benckmark with huge parquet file
aocsa Aug 22, 2018
30c581a
added compact_to_sparse_for_nulls
Aug 23, 2018
c129c94
starting with kernel
Aug 23, 2018
298dc3d
starting with kernel
Aug 23, 2018
7f0f570
[parquet-reader]: ToGdfColumn using gpu using ReadBatch
aocsa Aug 23, 2018
7da1549
reimplemented compact_to_sparse_for_nulls
Aug 23, 2018
6979c33
added includes
Aug 23, 2018
fbae2c8
Merge branch 'willParquetExp' into willParquetKernelExp
Aug 24, 2018
bceb98b
fixed build errors but commented out usage of compact_to_sparse_for_n…
Aug 24, 2018
26a5ce5
Merge branch 'willParquetExp' into willParquetKernelExp
Aug 24, 2018
869d9eb
[parquet-reader] toGdfColumn valid support and expand using ReadBatch
aocsa Aug 24, 2018
55c53ae
kernel compiles
Aug 24, 2018
3c97bb2
improved kernel call
Aug 24, 2018
8f06c8f
improved kernel call
Aug 24, 2018
12f6404
[parquet-reader]: custom gpu kernel for definition levels to valid_bits
aocsa Aug 24, 2018
149f8d3
[parquet-reader] Add test for valid and nulls
gcca Aug 25, 2018
93a0235
[parquet-reader] Merged from branch
gcca Aug 25, 2018
d4f0be9
[parquet-reader] Test nulls with two row groups
gcca Aug 25, 2018
616b303
[parquet-reader] Update conversion to gdf column
gcca Aug 27, 2018
ce430a4
Merge branch 'parquet-reader' into willParquetKernelExp
Aug 27, 2018
67068eb
changed unpack_using_gpu to use new kernel. Changed metadata gatherin…
Aug 27, 2018
98940b8
[parquet-reader]: ReadBatchSpace support on gpu
aocsa Aug 27, 2018
f639c2b
[parquet-reader] Remove unexistent directory
gcca Aug 27, 2018
51f7479
[parquet-reader] check unit test and benchmark
aocsa Aug 28, 2018
4f88e80
changed bitpack remainders implementation
Aug 28, 2018
9f6adb7
[parquet-reader] Read filtering by row_groups and columns indices
gcca Aug 28, 2018
19628d5
Merge branch 'parquet-reader' of github.com:BlazingDB/libgdf into par…
gcca Aug 28, 2018
42bf16d
[parquet-reader] Merged from master
gcca Aug 29, 2018
e6810b5
[parquet-reader] Update to work with arrow 0.9
gcca Aug 29, 2018
81d8cb9
merged in bitpacking kernels
Aug 31, 2018
dbcf578
[parquet-reader] Fix broken ByIdsInOrder unit test
gcca Aug 31, 2018
6d2e4b3
[parquet-reader] update benchmark
aocsa Aug 31, 2018
6646f09
Merge branch 'parquet-reader' of https://github.com/BlazingDB/libgdf …
aocsa Aug 31, 2018
94ea6a4
[parquet-reader] Add read column method
gcca Aug 31, 2018
2950374
fixed an issue with parquet-benchmark test
Sep 5, 2018
fc0a72e
[parquet-reader]: fix parquet reader (tested with mortgage data)
aocsa Sep 7, 2018
73703b0
implemented solution, need to change it to read valids separatelly an…
Sep 7, 2018
a905116
wip
Sep 7, 2018
d7740ca
Merge branch 'parquet-reader' into parquet-reader-multithread
Sep 7, 2018
74f741a
created seams for bitmaks, need to apply them back into device valid
Sep 7, 2018
fc85c2e
[parquet-reader] fix parquet benchmark
aocsa Sep 11, 2018
b6784de
[parquet-reader] rebase and fix types conversion
aocsa Sep 18, 2018
4eae308
modified unit test. Troubleshooting bugs
Sep 18, 2018
0f9cbf6
created single threaded version for debugging
Sep 18, 2018
849c866
Merge branch 'parquet-reader' into parquet-reader-multithread
Sep 18, 2018
e3d270e
fixed build errors, and issues with tests. Still getting errors with …
Sep 18, 2018
ea06079
[parquet-reader]: fix warnings
aocsa Sep 18, 2018
31326fa
[parquet-reader] Downgrade bison and flex
gcca Sep 18, 2018
55ab718
[parquet-reader] Add global ParquetCpp include directories
gcca Sep 18, 2018
c3f2552
[parquet-reader] Fix compiling warnings
gcca Sep 18, 2018
07e6e85
fixed bug in guard in bitpacking kernel
Sep 19, 2018
dc76e3d
[parquet-reader] fix bitpacking decoder and transform_valid
aocsa Sep 19, 2018
8bf8311
[parquet-reader]: merge with last fixes
aocsa Sep 19, 2018
951cbf9
[parquet-reader]: fix warnings
aocsa Sep 19, 2018
ab57c53
cleaned up code. Using _ReadFileMultiThread where it needs to. All te…
Sep 19, 2018
5002683
made small change to unit test and found more issues
Sep 19, 2018
a7ce67a
fixed bug in allocator function
Sep 19, 2018
9cd6e16
[parquet-reader-multithread] fix warnings
aocsa Sep 20, 2018
52b03f7
[parquet-reader-multithread] remove dead code and add comments
aocsa Sep 21, 2018
efcffd4
added new parquet-multithread-benchmark test. Fixed parquet-reader ap…
Sep 24, 2018
dd9a65f
fixed benchmark unit test
Sep 25, 2018
b342fe4
moved parquet benchmarks to bench folder
Sep 26, 2018
95b16e3
Merge branch 'master' into parquet-reader-multithread
Sep 27, 2018
d1e8ff7
added a new public API which takes in an file reading interface. Adde…
Oct 1, 2018
ec54c9a
Merge branch 'master' into parquet-reader-multithread
Oct 2, 2018
b7c2686
fixed interface implementation to be RandomAccessFile which is an int…
Oct 11, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,5 @@ python/libgdf_cffi/libgdf_cffi.py

## eclipse
.project

build2/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "build2" a common directory we need to gitignore? Perhaps this file was committed accidentally?

25 changes: 24 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -25,7 +26,7 @@

PROJECT(libgdf)

cmake_minimum_required(VERSION 2.8) # not sure about version required
cmake_minimum_required(VERSION 3.3) # not sure about version required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libgdf CMakeLists.txt requires make version 3.11... Maybe match that and remove the unsure comment.


set(CMAKE_CXX_STANDARD 11)
message(STATUS "Using C++ standard: c++${CMAKE_CXX_STANDARD}")
Expand All @@ -46,6 +47,7 @@ include(CTest)
# Include custom modules (see cmake directory)
include(ConfigureGoogleTest)
include(ConfigureArrow)
include(ConfigureParquetCpp)

find_package(CUDA)
set_package_properties(
Expand Down Expand Up @@ -83,12 +85,15 @@ else()
message(FATAL_ERROR "Apache Arrow not found, please check your settings.")
endif()

get_property(PARQUETCPP_INCLUDE_DIRS TARGET Apache::ParquetCpp PROPERTY INTERFACE_INCLUDE_DIRECTORIES)

include_directories(
"${CMAKE_CURRENT_SOURCE_DIR}/include"
"${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/cub"
"${CMAKE_CURRENT_SOURCE_DIR}/thirdparty/moderngpu/src"
"${CUDA_INCLUDE_DIRS}"
"${ARROW_INCLUDEDIR}"
"${PARQUETCPP_INCLUDE_DIRS}"
)

IF(CUDA_VERSION_MAJOR GREATER 7)
Expand Down Expand Up @@ -119,6 +124,19 @@ if(HT_LEGACY_ALLOCATOR)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-DHT_LEGACY_ALLOCATOR)
endif()

cuda_add_library(gdf-parquet
src/parquet/api.cpp
src/parquet/column_reader.cu
src/parquet/file_reader.cpp
src/parquet/file_reader_contents.cpp
src/parquet/page_reader.cpp
src/parquet/row_group_reader_contents.cpp
src/parquet/decoder/cu_level_decoder.cu
src/arrow/cu_decoder.cu
src/arrow/util/pinned_allocator.cu
)

target_link_libraries(gdf-parquet Apache::ParquetCpp)

cuda_add_library(gdf SHARED
src/binaryops.cu
Expand Down Expand Up @@ -198,5 +216,10 @@ if(GTEST_FOUND)
else()
message(AUTHOR_WARNING "Google C++ Testing Framework (Google Test) not found: automated tests are disabled.")
endif()

if(GDF_BENCHMARK)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/src/bench)
endif()

# Print the project summary
feature_summary(WHAT ALL INCLUDE_QUIET_PACKAGES FATAL_ON_MISSING_REQUIRED_PACKAGES)
3 changes: 2 additions & 1 deletion cmake/Modules/ConfigureArrow.cmake
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -15,7 +16,7 @@
# limitations under the License.
#=============================================================================

set(ARROW_DOWNLOAD_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/arrow-download/)
set(ARROW_DOWNLOAD_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/arrow-download)

# Download and unpack arrow at configure time
configure_file(${CMAKE_SOURCE_DIR}/cmake/Templates/Arrow.CMakeLists.txt.cmake ${ARROW_DOWNLOAD_BINARY_DIR}/CMakeLists.txt COPYONLY)
Expand Down
89 changes: 89 additions & 0 deletions cmake/Modules/ConfigureParquetCpp.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

# Download and unpack ParquetCpp at configure time
configure_file(${CMAKE_SOURCE_DIR}/cmake/Templates/ParquetCpp.CMakeLists.txt.cmake ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download/CMakeLists.txt)

execute_process(
COMMAND ${CMAKE_COMMAND} -F "${CMAKE_GENERATOR}" .
RESULT_VARIABLE result
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download/
)

if(result)
message(FATAL_ERROR "CMake step for ParquetCpp failed: ${result}")
endif()

# Transitive dependencies
set(ARROW_TRANSITIVE_DEPENDENCIES_PREFIX ${ARROW_DOWNLOAD_BINARY_DIR}/arrow-prefix/src/arrow-build)
set(BROTLI_TRANSITIVE_DEPENDENCY_PREFIX ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/brotli_ep/src/brotli_ep-install/lib/x86_64-linux-gnu)
set(BROTLI_STATIC_LIB_ENC ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlienc.a)
set(BROTLI_STATIC_LIB_DEC ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlidec.a)
set(BROTLI_STATIC_LIB_COMMON ${BROTLI_TRANSITIVE_DEPENDENCY_PREFIX}/libbrotlicommon.a)
set(SNAPPY_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/snappy_ep/src/snappy_ep-install/lib/libsnappy.a)
set(ZLIB_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/zlib_ep/src/zlib_ep-install/lib/libz.a)
set(LZ4_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a)
set(ZSTD_STATIC_LIB ${ARROW_TRANSITIVE_DEPENDENCIES_PREFIX}/zstd_ep-prefix/src/zstd_ep/lib/libzstd.a)
set(ARROW_HOME ${ARROW_ROOT})

set(ENV{BROTLI_STATIC_LIB_ENC} ${BROTLI_STATIC_LIB_ENC})
set(ENV{BROTLI_STATIC_LIB_DEC} ${BROTLI_STATIC_LIB_DEC})
set(ENV{BROTLI_STATIC_LIB_COMMON} ${BROTLI_STATIC_LIB_COMMON})
set(ENV{SNAPPY_STATIC_LIB} ${SNAPPY_STATIC_LIB})
set(ENV{ZLIB_STATIC_LIB} ${ZLIB_STATIC_LIB})
set(ENV{LZ4_STATIC_LIB} ${LZ4_STATIC_LIB})
set(ENV{ZSTD_STATIC_LIB} ${ZSTD_STATIC_LIB})
set(ENV{ARROW_HOME} ${ARROW_HOME})

execute_process(
COMMAND ${CMAKE_COMMAND} --build .
RESULT_VARIABLE result
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-download)

if(result)
message(FATAL_ERROR "Build step for ParquetCpp failed: ${result}")
endif()

# Add transitive dependency: Thrift
set(THRIFT_ROOT ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build/thrift_ep/src/thrift_ep-install)

# Locate ParquetCpp package
set(PARQUETCPP_ROOT ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install)
set(PARQUETCPP_BINARY_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build)
set(PARQUETCPP_SOURCE_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-src)

# Dependency interfaces
find_package(Boost REQUIRED COMPONENTS regex)

add_library(Apache::Thrift INTERFACE IMPORTED)
set_target_properties(Apache::Thrift
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES ${THRIFT_ROOT}/include)
set_target_properties(Apache::Thrift
PROPERTIES INTERFACE_LINK_LIBRARIES ${THRIFT_ROOT}/lib/libthrift.a)

add_library(Apache::Arrow INTERFACE IMPORTED)
set_target_properties(Apache::Arrow
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES ${ARROW_ROOT}/include)
set_target_properties(Apache::Arrow
PROPERTIES INTERFACE_LINK_LIBRARIES "${ARROW_ROOT}/lib/libarrow.a;${BROTLI_STATIC_LIB_ENC};${BROTLI_STATIC_LIB_DEC};${BROTLI_STATIC_LIB_COMMON};${SNAPPY_STATIC_LIB};${ZLIB_STATIC_LIB};${LZ4_STATIC_LIB};${ZSTD_STATIC_LIB}")

add_library(Apache::ParquetCpp INTERFACE IMPORTED)
set_target_properties(Apache::ParquetCpp
PROPERTIES INTERFACE_INCLUDE_DIRECTORIES
"${PARQUETCPP_ROOT}/include;${PARQUETCPP_BINARY_DIR}/src;${PARQUETCPP_SOURCE_DIR}/src")
set_target_properties(Apache::ParquetCpp
PROPERTIES INTERFACE_LINK_LIBRARIES "${PARQUETCPP_ROOT}/lib/libparquet.a;Apache::Arrow;Apache::Thrift;Boost::regex")
14 changes: 5 additions & 9 deletions cmake/Templates/Arrow.CMakeLists.txt.cmake
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Percy Camilo Triveño Aucahuasi <[email protected]>
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -23,7 +24,7 @@ project(arrow-download NONE)

include(ExternalProject)

set(ARROW_VERSION "apache-arrow-0.10.0")
set(ARROW_VERSION "apache-arrow-0.9.0")

if (NOT "$ENV{PARQUET_ARROW_VERSION}" STREQUAL "")
set(ARROW_VERSION "$ENV{PARQUET_ARROW_VERSION}")
Expand All @@ -34,24 +35,19 @@ message(STATUS "Using Apache Arrow version: ${ARROW_VERSION}")
set(ARROW_URL "https://github.com/apache/arrow/archive/${ARROW_VERSION}.tar.gz")

set(ARROW_CMAKE_ARGS
#Arrow dependencies
-DARROW_WITH_LZ4=OFF
-DARROW_WITH_ZSTD=OFF
-DARROW_WITH_BROTLI=OFF
-DARROW_WITH_SNAPPY=OFF
-DARROW_WITH_ZLIB=OFF

#Build settings
-DARROW_BUILD_STATIC=ON
-DARROW_BUILD_SHARED=OFF
-DARROW_BOOST_USE_SHARED=ON
-DARROW_BUILD_TESTS=OFF
-DARROW_TEST_MEMCHECK=OFF
-DARROW_BUILD_BENCHMARKS=OFF
-DARROW_BUILD_UTILITIES=OFF
-DARROW_JEMALLOC=OFF

#Arrow modules
-DARROW_IPC=ON
-DARROW_COMPUTE=OFF
-DARROW_COMPUTE=ON
-DARROW_GPU=OFF
-DARROW_JEMALLOC=OFF
-DARROW_BOOST_VENDORED=OFF
Expand Down
44 changes: 44 additions & 0 deletions cmake/Templates/ParquetCpp.CMakeLists.txt.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#=============================================================================
# Copyright 2018 BlazingDB, Inc.
# Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

cmake_minimum_required(VERSION 2.8.12)

project(parquetcpp-download NONE)

include(ExternalProject)

set(PARQUET_VERSION apache-parquet-cpp-1.4.0)

if (NOT $ENV{PARQUET_VERSION} STREQUAL "")
set(PARQUET_VERSION $ENV{PARQUET_VETSION})
endif()

message(STATUS "Using Apache ParquetCpp version: ${PARQUET_VERSION}")

ExternalProject_Add(parquetcpp
BINARY_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-build"
CMAKE_ARGS
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install
-DPARQUET_ARROW_LINKAGE=static
-DPARQUET_BUILD_SHARED=OFF
-DPARQUET_BUILD_TESTS=OFF
GIT_REPOSITORY https://github.com/apache/parquet-cpp.git
GIT_TAG apache-parquet-cpp-1.4.0
INSTALL_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-install"
SOURCE_DIR "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/thirdparty/parquetcpp-src"
)
2 changes: 2 additions & 0 deletions conda_environments/dev_py35.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,6 @@ dependencies:
- llvmlite=0.18.0=py35_0
- numba=0.34.0.dev=np112py35_316
- cmake=3.6.3=0
- flex=2.6.0
- bison=3.0.4
- pyarrow=0.10.0
2 changes: 2 additions & 0 deletions include/gdf/cffi/types.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ typedef enum {
GDF_INVALID_API_CALL, /**< The arguments passed into the function were invalid */
GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */
GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/

GDF_IO_ERROR, /**< Error occured in a parquet-reader api which load a parquet file into gdf_columns */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, IO_ERROR seems generic enough of a name to apply to more than the parquet reader. Suggest either narrowing the name or broadening the comment.

GDF_DTYPE_MISMATCH, /**< Type mismatch between columns that should be the same type */
GDF_UNSUPPORTED_METHOD, /**< The method requested to perform an operation was invalid or unsupported (e.g., hash vs. sort)*/
GDF_INVALID_AGGREGATOR, /**< Invalid aggregator was specified for a groupby*/
Expand Down
80 changes: 80 additions & 0 deletions include/gdf/parquet/api.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
/*
* Copyright 2018 BlazingDB, Inc.
* Copyright 2018 Cristhian Alberto Gonzales Castillo <[email protected]>
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <gdf/gdf.h>

#ifdef __cplusplus
#define BEGIN_NAMESPACE_GDF_PARQUET \
namespace gdf { \
namespace parquet {
#define END_NAMESPACE_GDF_PARQUET \
} \
}
#else
#define BEGIN_NAMESPACE_GDF_PARQUET
#define END_NAMESPACE_GDF_PARQUET
#endif

BEGIN_NAMESPACE_GDF_PARQUET

/// \brief Read parquet file from file path into array of gdf columns
/// \param[in] filename path to parquet file
/// \param[in] columns will be read from the file
/// \param[out] out_gdf_columns array
/// \param[out] out_gdf_columns_length number of columns
extern "C" gdf_error read_parquet(const char *const filename,
const char *const *const columns,
gdf_column **const out_gdf_columns,
size_t *const out_gdf_columns_length);

END_NAMESPACE_GDF_PARQUET

#ifdef __cplusplus

#include <string>
#include <vector>
#include <arrow/io/file.h>

namespace gdf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use the BEGIN_NAMESPACE_GDF_PARQUET macro above, but not here?

namespace parquet {

/// \brief Read parquet file from file path into array of gdf columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// \param[in] filename path to parquet file
/// \param[in] indices of the rowgroups that will be read from the file
/// \param[in] indices of the columns that will be read from the file
/// \param[out] out_gdf_columns vector of gdf_column pointers. The data read.
gdf_error
read_parquet_by_ids(const std::string & filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "by_ids" signify? It's not explained in the comment.

const std::vector<std::size_t> &row_group_indices,
const std::vector<std::size_t> &column_indices,
std::vector<gdf_column *> & out_gdf_columns);

/// \brief Read parquet file from file interface into array of gdf columns
/// \param[in] filename path to parquet file
/// \param[in] indices of the rowgroups that will be read from the file
/// \param[in] indices of the columns that will be read from the file
/// \param[out] out_gdf_columns vector of gdf_column pointers. The data read.
gdf_error
read_parquet_by_ids(std::shared_ptr<::arrow::io::RandomAccessFile> file,
const std::vector<std::size_t> &row_group_indices,
const std::vector<std::size_t> &column_indices,
std::vector<gdf_column *> & out_gdf_columns);

} // namespace parquet
} // namespace gdf

#endif
Loading