-
Notifications
You must be signed in to change notification settings - Fork 43
[Review] Parquet reader multithread #146
base: master
Are you sure you want to change the base?
Conversation
include/gdf/cffi/types.h
Outdated
@@ -44,6 +44,7 @@ typedef enum { | |||
GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */ | |||
GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/ | |||
GDF_GROUPBY_TOO_MANY_COLUMNS, | |||
GDF_IO_ERROR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is good practice to insert into the middle of an enum. Also, can you please provide a docstring. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Thanks for the feedback.
src/parquet/dictionary_decoder.cuh
Outdated
// std::memcpy(bytes_data + offset, dictionary_[i].ptr, fixed_len); | ||
// dictionary_[i].ptr = bytes_data + offset; | ||
// } | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all right!
src/parquet/plain_decoder.cuh
Outdated
// data_size -= type_length; | ||
// } | ||
// return bytes_to_decode; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all right!
…i so that it ignores string columns.
…erface instead of ReadableFile which was a class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a ton of code to review. We should probably have an overview meeting like we discussed having for the binary ops PR. Unless you have already done this!
@@ -18,3 +18,5 @@ python/libgdf_cffi/libgdf_cffi.py | |||
|
|||
## eclipse | |||
.project | |||
|
|||
build2/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "build2" a common directory we need to gitignore? Perhaps this file was committed accidentally?
@@ -25,7 +26,7 @@ | |||
|
|||
PROJECT(libgdf) | |||
|
|||
cmake_minimum_required(VERSION 2.8) # not sure about version required | |||
cmake_minimum_required(VERSION 3.3) # not sure about version required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libgdf CMakeLists.txt requires make version 3.11... Maybe match that and remove the unsure comment.
@@ -48,6 +48,8 @@ typedef enum { | |||
GDF_INVALID_API_CALL, /**< The arguments passed into the function were invalid */ | |||
GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */ | |||
GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/ | |||
|
|||
GDF_IO_ERROR, /**< Error occured in a parquet-reader api which load a parquet file into gdf_columns */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, IO_ERROR seems generic enough of a name to apply to more than the parquet reader. Suggest either narrowing the name or broadening the comment.
#include <vector> | ||
#include <arrow/io/file.h> | ||
|
||
namespace gdf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use the BEGIN_NAMESPACE_GDF_PARQUET macro above, but not here?
/// \param[in] indices of the columns that will be read from the file | ||
/// \param[out] out_gdf_columns vector of gdf_column pointers. The data read. | ||
gdf_error | ||
read_parquet_by_ids(const std::string & filename, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "by_ids" signify? It's not explained in the comment.
namespace gdf { | ||
namespace parquet { | ||
|
||
/// \brief Read parquet file from file path into array of gdf columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the comment templates given in https://github.com/rapidsai/libgdf/blob/master/src/example_documentation.cpp
This pull request is ontop of the parquet-reader branch pull request. It uses the same APIs, but it will read all column and rowgroups in parallel using a number of threads up to hardware_concurrency. This PR also contains some improvements to the file_reader-test unit test and a bug fix