Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2261 Size Statistics #14000

Merged
merged 173 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
173 commits
Select commit Hold shift + click to select a range
9589fc3
stub in new SizeStatistics
etseidl Aug 25, 2023
21ab768
use block_reduce to generate histograms. cuts gpuEncodePages time con…
etseidl Aug 29, 2023
76d54cf
move gen_hist to anonymous namespace
etseidl Aug 29, 2023
7d2c41e
add FIXME for ColumnIndex struct
etseidl Aug 29, 2023
58e64a1
update FIXME for column index size calc
etseidl Aug 29, 2023
d7a6591
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Aug 30, 2023
b52d226
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl Aug 30, 2023
3de71b0
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Aug 31, 2023
b81d2cc
checkpoint option 2
etseidl Aug 31, 2023
a2ad28e
do not write histograms if neither crosses the threshold
etseidl Aug 31, 2023
dc3c637
get rid of unnecessary friends
etseidl Sep 1, 2023
b61de00
fix bug in histogram calc
etseidl Sep 1, 2023
da4aac1
Merge branch 'size_statistics' into sizes_opt2
etseidl Sep 1, 2023
44051f3
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 1, 2023
44e0f23
Merge remote-tracking branch 'origin/branch-23.10' into sizes_opt2
etseidl Sep 1, 2023
6a5b021
Merge branch 'sizes_opt2' into size_statistics
etseidl Sep 1, 2023
51773e3
latest from #197
etseidl Sep 1, 2023
5fc47b9
a few more tweaks
etseidl Sep 1, 2023
abba811
add some comments
etseidl Sep 1, 2023
be7f7f1
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 4, 2023
5843073
move cutoffs to header
etseidl Sep 5, 2023
ee205cd
update column index size calc
etseidl Sep 5, 2023
09c0c49
fix handling of optional list of structs
etseidl Sep 5, 2023
18d4158
add page indexes to ColumnChunk struct
etseidl Sep 5, 2023
fb3b6cc
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 7, 2023
94a8d64
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 7, 2023
0c23dca
latest from parquet-2261
etseidl Sep 7, 2023
06f0bef
add TODO
etseidl Sep 8, 2023
74da962
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 8, 2023
00974b8
rename optional functors to match existing precedent
etseidl Sep 8, 2023
4140d30
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 9, 2023
db0779d
rework compact protocol reader
etseidl Sep 11, 2023
da2fb0e
Merge branch 'branch-23.10' into size_statistics
etseidl Sep 11, 2023
5376a83
fix for string list
etseidl Sep 11, 2023
cf531cf
clean up more inadvertent overloads of field()
etseidl Sep 11, 2023
971931d
clean up enums
etseidl Sep 11, 2023
c4126c1
refactor list functors
etseidl Sep 11, 2023
fae64c0
fix for int list functor
etseidl Sep 11, 2023
6a50029
more list refactoring
etseidl Sep 12, 2023
9725f70
move functors to cpp file. they're only used by the read methods, so
etseidl Sep 12, 2023
c3b6422
refactor compact protocol reader
etseidl Sep 12, 2023
e84baf6
use CRTP to get rid of pure virtual
etseidl Sep 12, 2023
fa50d62
Revert "use CRTP to get rid of pure virtual"
etseidl Sep 12, 2023
28d71c0
fix get_uxx functions
etseidl Sep 12, 2023
a474bec
replace pure virtual read_value with std::function
etseidl Sep 12, 2023
9220f35
rework implementation of the `column_orders` field in file meta data
etseidl Sep 12, 2023
9f2e898
clean up
etseidl Sep 12, 2023
a91a196
more cleanup
etseidl Sep 12, 2023
fd9e3f8
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl Sep 12, 2023
4a677f6
clean up remaining single-line if statements
etseidl Sep 12, 2023
e70b810
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl Sep 12, 2023
bf9e073
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl Sep 12, 2023
eb519f2
more consts
etseidl Sep 12, 2023
943be91
what can you apply apart from const...more const!
etseidl Sep 12, 2023
1cfe326
clean up header
etseidl Sep 12, 2023
9823f11
Merge branch 'rapidsai:branch-23.10' into refactor_parquet_thrift
etseidl Sep 13, 2023
a0a758f
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 13, 2023
ac7b665
add FIXME
etseidl Sep 14, 2023
76f16dd
Merge branch 'refactor_parquet_thrift' of github.com:etseidl/cudf int…
etseidl Sep 14, 2023
4f49ef1
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl Sep 14, 2023
e757616
add documentation to empty struct functor
etseidl Sep 14, 2023
71e8eab
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 14, 2023
13200ff
clean up skip_struct_field some more
etseidl Sep 14, 2023
16df9e5
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl Sep 15, 2023
075e11e
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 15, 2023
8a4820b
Merge branch 'rapidsai:branch-23.10' into refactor_parquet_thrift
etseidl Sep 15, 2023
5700b21
convert union to enum with state
etseidl Sep 18, 2023
45f3249
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl Sep 18, 2023
0ade852
use thrust::optional rather than std::optional as some fields may
etseidl Sep 18, 2023
aac2f33
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl Sep 18, 2023
0ae2fc4
missed a use of std::optional
etseidl Sep 18, 2023
f6dcb52
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl Sep 19, 2023
10df4a0
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl Sep 19, 2023
15a2831
more snake case
etseidl Sep 19, 2023
aad3908
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl Sep 20, 2023
ec81897
finish merge
etseidl Sep 20, 2023
9fafa22
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 20, 2023
ae164b9
Merge branch 'branch-23.10' into size_statistics
etseidl Sep 21, 2023
05a7fa2
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 22, 2023
3cc69eb
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 27, 2023
5ac163b
Merge branch 'branch-23.12' into size_statistics
etseidl Sep 27, 2023
9680b3d
Merge branch 'branch-23.12' into size_statistics
etseidl Sep 27, 2023
6854683
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 28, 2023
ec1f138
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Sep 28, 2023
b2a9488
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl Sep 28, 2023
348bf11
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Sep 28, 2023
57b64ba
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Sep 29, 2023
3b405c8
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 3, 2023
6d53225
Merge branch 'branch-23.12' into size_statistics
etseidl Oct 3, 2023
b53bca1
Merge branch 'branch-23.12' into size_statistics
etseidl Oct 4, 2023
e96b225
Merge branch 'branch-23.12' into size_statistics
etseidl Oct 4, 2023
069eb3d
Merge branch 'branch-23.12' into size_statistics
etseidl Oct 4, 2023
4c17cf0
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Oct 6, 2023
fe6f1a8
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Oct 9, 2023
5576e6d
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 10, 2023
3334cec
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 11, 2023
b111ff2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 16, 2023
e4c9911
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 18, 2023
1b0b435
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 20, 2023
e311781
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Oct 25, 2023
07805bb
finish merge
etseidl Oct 25, 2023
df13cb7
fix alignment issue
etseidl Oct 26, 2023
2223be2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 27, 2023
551554d
Merge branch 'branch-23.12' into size_statistics
etseidl Oct 28, 2023
a028df8
revert some changes not related to PARQUET-2261 (https://github.com/a…
etseidl Oct 28, 2023
244c369
more clean up
etseidl Oct 28, 2023
03ea33a
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Oct 30, 2023
74f25d7
remove initializers for optional members
etseidl Oct 30, 2023
f8481c9
clean up comments
etseidl Oct 30, 2023
1bc77cd
more cleanup
etseidl Oct 30, 2023
1eeaee0
add test of histograms and string sizes
etseidl Oct 30, 2023
6e4b9b1
use a faster histogram generator where possible
etseidl Oct 31, 2023
536642c
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Oct 31, 2023
59982b2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Nov 1, 2023
1393232
add some documentation
etseidl Nov 1, 2023
ddc5463
change signature of gen_histograms_list_col
etseidl Nov 1, 2023
7826c0c
change gen_hist to a better name
etseidl Nov 1, 2023
b6067ac
fix up leaf nullability check
etseidl Nov 1, 2023
cf973f4
better way to figure out leaf nullability
etseidl Nov 1, 2023
8746540
fix docstring
etseidl Nov 1, 2023
43c980f
rename function
etseidl Nov 1, 2023
72b9a57
add some more histogram checks
etseidl Nov 2, 2023
ff7ffbc
add some more stats checks
etseidl Nov 2, 2023
69a081b
clean up docs, no need to pass lvl_start as it is always 1
etseidl Nov 2, 2023
d4a1dba
clean up another comment
etseidl Nov 2, 2023
a6f772d
var_bytes calculation was wrong...need valid count for each fragment
etseidl Nov 3, 2023
12ed759
add num_valid to EncPage
etseidl Nov 3, 2023
717d53b
remove unused aliases
etseidl Nov 3, 2023
551f728
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Nov 6, 2023
1953714
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Nov 7, 2023
cf793d6
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Nov 8, 2023
90829f8
clean up some tech debt found in review
etseidl Nov 8, 2023
fa36c23
Merge branch 'size_statistics' of github.com:etseidl/cudf into size_s…
etseidl Nov 8, 2023
3ae2289
do not rely on arithmetic conversion in if statement
etseidl Nov 8, 2023
26fc762
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl Nov 8, 2023
18d4939
Merge branch 'branch-23.12' into size_statistics
etseidl Nov 9, 2023
7cb0524
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl Nov 9, 2023
9440cd0
Merge branch 'branch-23.12' into size_statistics
etseidl Nov 11, 2023
cebc8ee
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 13, 2023
e1b427d
implement suggestion from september
etseidl Nov 14, 2023
2155f18
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Nov 16, 2023
206e741
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 16, 2023
220704a
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Nov 17, 2023
68dcc58
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 17, 2023
ed573ea
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Nov 20, 2023
fe24d51
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Nov 20, 2023
24c73f8
no need to pass in valid_count as it is on the page now
etseidl Nov 21, 2023
6687b32
fix typo
etseidl Nov 21, 2023
7d2075b
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 21, 2023
335d0e7
clean up some histogram finishing work
etseidl Nov 28, 2023
01fc44f
add helper functions to calculate num data pages
etseidl Nov 28, 2023
b06d515
a few fixes from review
etseidl Nov 28, 2023
0dcb674
the spec removed the struct enclosing the histograms
etseidl Nov 28, 2023
9d66218
rework stats encoding to use iterators
etseidl Nov 28, 2023
223bfab
rework host-side histogram collection per review suggestion
etseidl Nov 28, 2023
2001271
use device_uvector rather than hostdevice_vector for histograms
etseidl Nov 28, 2023
ef38d1e
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Nov 28, 2023
7eaad88
fix cut-and-paste error
etseidl Nov 28, 2023
2c4bebb
use num_data_pages in col index size calc
etseidl Nov 28, 2023
b458eb6
add comment to explain 5 byte requirement
etseidl Nov 28, 2023
0665183
histograms can be initialized asynchronously
etseidl Nov 28, 2023
c5d7e50
missed another place to use num_data_pages()
etseidl Nov 28, 2023
9f859be
get rid of unnecessary casts
etseidl Nov 28, 2023
acef3d5
remove unused function
etseidl Nov 28, 2023
6a8a02e
move num_data_pages/num_dict_pages into EncColumnChunk
etseidl Nov 29, 2023
49ea23f
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 29, 2023
0c6b6b5
Merge branch 'branch-24.02' into size_statistics
etseidl Nov 29, 2023
33ab1e9
Merge branch 'branch-24.02' into size_statistics
etseidl Dec 2, 2023
36785e2
Merge branch 'rapidsai:branch-24.02' into size_statistics
etseidl Dec 5, 2023
862752e
clean up some comments
etseidl Dec 5, 2023
3bacb21
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl Dec 6, 2023
78efd2e
cast signed to unsigned for comparison
etseidl Dec 6, 2023
97f2344
Merge branch 'branch-24.02' into size_statistics
ttnghia Dec 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 26 additions & 5 deletions cpp/src/io/parquet/compact_protocol_reader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ class parquet_field_union_struct : public parquet_field {
inline bool operator()(CompactProtocolReader* cpr, int field_type)
{
T v;
bool const res = parquet_field_struct<T>(field(), v).operator()(cpr, field_type);
bool const res = parquet_field_struct<T>{field(), v}(cpr, field_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank @vuule :D

if (!res) {
val = v;
enum_val = static_cast<E>(field());
Expand Down Expand Up @@ -424,7 +424,7 @@ class parquet_field_optional : public parquet_field {
inline bool operator()(CompactProtocolReader* cpr, int field_type)
{
T v;
bool const res = FieldFunctor(field(), v).operator()(cpr, field_type);
bool const res = FieldFunctor{field(), v}(cpr, field_type);
if (!res) { val = v; }
return res;
}
Expand Down Expand Up @@ -631,6 +631,8 @@ bool CompactProtocolReader::read(ColumnChunk* c)

bool CompactProtocolReader::read(ColumnChunkMetaData* c)
{
using optional_size_statistics =
parquet_field_optional<SizeStatistics, parquet_field_struct<SizeStatistics>>;
auto op = std::make_tuple(parquet_field_enum<Type>(1, c->type),
parquet_field_enum_list(2, c->encodings),
parquet_field_string_list(3, c->path_in_schema),
Expand All @@ -641,7 +643,8 @@ bool CompactProtocolReader::read(ColumnChunkMetaData* c)
parquet_field_int64(9, c->data_page_offset),
parquet_field_int64(10, c->index_page_offset),
parquet_field_int64(11, c->dictionary_page_offset),
parquet_field_struct(12, c->statistics));
parquet_field_struct(12, c->statistics),
optional_size_statistics(16, c->size_statistics));
return function_builder(this, op);
}

Expand Down Expand Up @@ -700,17 +703,35 @@ bool CompactProtocolReader::read(PageLocation* p)

bool CompactProtocolReader::read(OffsetIndex* o)
{
auto op = std::make_tuple(parquet_field_struct_list(1, o->page_locations));
using optional_list_i64 = parquet_field_optional<std::vector<int64_t>, parquet_field_int64_list>;

auto op = std::make_tuple(parquet_field_struct_list(1, o->page_locations),
optional_list_i64(2, o->unencoded_byte_array_data_bytes));
return function_builder(this, op);
}

bool CompactProtocolReader::read(SizeStatistics* s)
{
using optional_i64 = parquet_field_optional<int64_t, parquet_field_int64>;
using optional_list_i64 = parquet_field_optional<std::vector<int64_t>, parquet_field_int64_list>;

auto op = std::make_tuple(optional_i64(1, s->unencoded_byte_array_data_bytes),
optional_list_i64(2, s->repetition_level_histogram),
optional_list_i64(3, s->definition_level_histogram));
return function_builder(this, op);
}

bool CompactProtocolReader::read(ColumnIndex* c)
{
using optional_list_i64 = parquet_field_optional<std::vector<int64_t>, parquet_field_int64_list>;

auto op = std::make_tuple(parquet_field_bool_list(1, c->null_pages),
parquet_field_binary_list(2, c->min_values),
parquet_field_binary_list(3, c->max_values),
parquet_field_enum<BoundaryOrder>(4, c->boundary_order),
parquet_field_int64_list(5, c->null_counts));
parquet_field_int64_list(5, c->null_counts),
optional_list_i64(6, c->repetition_level_histogram),
optional_list_i64(7, c->definition_level_histogram));
return function_builder(this, op);
}

Expand Down
1 change: 1 addition & 0 deletions cpp/src/io/parquet/compact_protocol_reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ class CompactProtocolReader {
bool read(KeyValue* k);
bool read(PageLocation* p);
bool read(OffsetIndex* o);
bool read(SizeStatistics* s);
bool read(ColumnIndex* c);
bool read(Statistics* s);
bool read(ColumnOrder* c);
Expand Down
38 changes: 35 additions & 3 deletions cpp/src/io/parquet/compact_protocol_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ size_t CompactProtocolWriter::write(ColumnChunkMetaData const& s)
if (s.index_page_offset != 0) { c.field_int(10, s.index_page_offset); }
if (s.dictionary_page_offset != 0) { c.field_int(11, s.dictionary_page_offset); }
c.field_struct(12, s.statistics);
if (s.size_statistics.has_value()) { c.field_struct(16, s.size_statistics.value()); }
return c.value();
}

Expand Down Expand Up @@ -210,6 +211,24 @@ size_t CompactProtocolWriter::write(OffsetIndex const& s)
{
CompactProtocolFieldWriter c(*this);
c.field_struct_list(1, s.page_locations);
if (s.unencoded_byte_array_data_bytes.has_value()) {
c.field_int_list(2, s.unencoded_byte_array_data_bytes.value());
}
return c.value();
}

size_t CompactProtocolWriter::write(SizeStatistics const& s)
{
CompactProtocolFieldWriter c(*this);
if (s.unencoded_byte_array_data_bytes.has_value()) {
c.field_int(1, s.unencoded_byte_array_data_bytes.value());
}
if (s.repetition_level_histogram.has_value()) {
c.field_int_list(2, s.repetition_level_histogram.value());
}
if (s.definition_level_histogram.has_value()) {
c.field_int_list(3, s.definition_level_histogram.value());
}
return c.value();
}

Expand Down Expand Up @@ -286,13 +305,26 @@ inline void CompactProtocolFieldWriter::field_int(int field, int64_t val)
current_field_value = field;
}

template <>
inline void CompactProtocolFieldWriter::field_int_list<int64_t>(int field,
std::vector<int64_t> const& val)
{
put_field_header(field, current_field_value, ST_FLD_LIST);
put_byte(static_cast<uint8_t>((std::min(val.size(), 0xfUL) << 4) | ST_FLD_I64));
if (val.size() >= 0xfUL) { put_uint(val.size()); }
for (auto const v : val) {
put_int(v);
}
current_field_value = field;
}

template <typename Enum>
inline void CompactProtocolFieldWriter::field_int_list(int field, std::vector<Enum> const& val)
{
put_field_header(field, current_field_value, ST_FLD_LIST);
put_byte((uint8_t)((std::min(val.size(), (size_t)0xfu) << 4) | ST_FLD_I32));
if (val.size() >= 0xf) put_uint(val.size());
for (auto& v : val) {
put_byte(static_cast<uint8_t>((std::min(val.size(), 0xfUL) << 4) | ST_FLD_I32));
if (val.size() >= 0xfUL) { put_uint(val.size()); }
for (auto const& v : val) {
put_int(static_cast<int32_t>(v));
}
current_field_value = field;
Expand Down
5 changes: 5 additions & 0 deletions cpp/src/io/parquet/compact_protocol_writer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ class CompactProtocolWriter {
size_t write(Statistics const&);
size_t write(PageLocation const&);
size_t write(OffsetIndex const&);
size_t write(SizeStatistics const&);
size_t write(ColumnOrder const&);

protected:
Expand Down Expand Up @@ -113,4 +114,8 @@ class CompactProtocolFieldWriter {
inline void set_current_field(int const& field);
};

template <>
inline void CompactProtocolFieldWriter::field_int_list<int64_t>(int field,
std::vector<int64_t> const& val);

} // namespace cudf::io::parquet::detail
Loading