Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

[Review] Parquet reader multithread #146

Open
wants to merge 109 commits into
base: master
Choose a base branch
from
Open

[Review] Parquet reader multithread #146

wants to merge 109 commits into from

Conversation

wmalpica
Copy link
Contributor

This pull request is ontop of the parquet-reader branch pull request. It uses the same APIs, but it will read all column and rowgroups in parallel using a number of threads up to hardware_concurrency. This PR also contains some improvements to the file_reader-test unit test and a bug fix

gcca and others added 30 commits July 19, 2018 12:18
@wmalpica wmalpica changed the title Parquet reader multithread [Review] Parquet reader multithread Sep 19, 2018
@@ -44,6 +44,7 @@ typedef enum {
GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */
GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/
GDF_GROUPBY_TOO_MANY_COLUMNS,
GDF_IO_ERROR,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is good practice to insert into the middle of an enum. Also, can you please provide a docstring. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Thanks for the feedback.

// std::memcpy(bytes_data + offset, dictionary_[i].ptr, fixed_len);
// dictionary_[i].ptr = bytes_data + offset;
// }
// }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove dead code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right!

// data_size -= type_length;
// }
// return bytes_to_decode;
// }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all right!

@kkraus14 kkraus14 added the 3 - Ready for Review Ready for review by team label Sep 24, 2018
@nsakharnykh nsakharnykh mentioned this pull request Oct 24, 2018
3 tasks
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ton of code to review. We should probably have an overview meeting like we discussed having for the binary ops PR. Unless you have already done this!

@@ -18,3 +18,5 @@ python/libgdf_cffi/libgdf_cffi.py

## eclipse
.project

build2/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "build2" a common directory we need to gitignore? Perhaps this file was committed accidentally?

@@ -25,7 +26,7 @@

PROJECT(libgdf)

cmake_minimum_required(VERSION 2.8) # not sure about version required
cmake_minimum_required(VERSION 3.3) # not sure about version required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libgdf CMakeLists.txt requires make version 3.11... Maybe match that and remove the unsure comment.

@@ -48,6 +48,8 @@ typedef enum {
GDF_INVALID_API_CALL, /**< The arguments passed into the function were invalid */
GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */
GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/

GDF_IO_ERROR, /**< Error occured in a parquet-reader api which load a parquet file into gdf_columns */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, IO_ERROR seems generic enough of a name to apply to more than the parquet reader. Suggest either narrowing the name or broadening the comment.

#include <vector>
#include <arrow/io/file.h>

namespace gdf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use the BEGIN_NAMESPACE_GDF_PARQUET macro above, but not here?

/// \param[in] indices of the columns that will be read from the file
/// \param[out] out_gdf_columns vector of gdf_column pointers. The data read.
gdf_error
read_parquet_by_ids(const std::string & filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "by_ids" signify? It's not explained in the comment.

namespace gdf {
namespace parquet {

/// \brief Read parquet file from file path into array of gdf columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mike-wendt mike-wendt added 4 - Needs Rework Additional work is needed migrate to cudf and removed 3 - Ready for Review Ready for review by team labels Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
4 - Needs Rework Additional work is needed migrate to cudf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants