-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261 Size Statistics #14000
PARQUET-2261 Size Statistics #14000
Changes from 130 commits
9589fc3
21ab768
76d54cf
7d2c41e
58e64a1
d7a6591
b52d226
3de71b0
b81d2cc
a2ad28e
dc3c637
b61de00
da4aac1
44051f3
44e0f23
6a5b021
51773e3
5fc47b9
abba811
be7f7f1
5843073
ee205cd
09c0c49
18d4158
fb3b6cc
94a8d64
0c23dca
06f0bef
74da962
00974b8
4140d30
db0779d
da2fb0e
5376a83
cf531cf
971931d
c4126c1
fae64c0
6a50029
9725f70
c3b6422
e84baf6
fa50d62
28d71c0
a474bec
9220f35
9f2e898
a91a196
fd9e3f8
4a677f6
e70b810
bf9e073
eb519f2
943be91
1cfe326
9823f11
a0a758f
ac7b665
76f16dd
4f49ef1
e757616
71e8eab
13200ff
16df9e5
075e11e
8a4820b
5700b21
45f3249
0ade852
aac2f33
0ae2fc4
f6dcb52
10df4a0
15a2831
aad3908
ec81897
9fafa22
ae164b9
05a7fa2
3cc69eb
5ac163b
9680b3d
6854683
ec1f138
b2a9488
348bf11
57b64ba
3b405c8
6d53225
b53bca1
e96b225
069eb3d
4c17cf0
fe6f1a8
5576e6d
3334cec
b111ff2
e4c9911
1b0b435
e311781
07805bb
df13cb7
2223be2
551554d
a028df8
244c369
03ea33a
74f25d7
f8481c9
1bc77cd
1eeaee0
6e4b9b1
536642c
59982b2
1393232
ddc5463
7826c0c
b6067ac
cf973f4
8746540
43c980f
72b9a57
ff7ffbc
69a081b
d4a1dba
a6f772d
12ed759
717d53b
551f728
1953714
cf793d6
90829f8
fa36c23
3ae2289
26fc762
18d4939
7cb0524
9440cd0
cebc8ee
e1b427d
2155f18
206e741
220704a
68dcc58
ed573ea
fe24d51
24c73f8
6687b32
7d2075b
335d0e7
01fc44f
b06d515
0dcb674
9d66218
223bfab
2001271
ef38d1e
7eaad88
2c4bebb
b458eb6
0665183
c5d7e50
9f859be
acef3d5
6a8a02e
49ea23f
0c6b6b5
33ab1e9
36785e2
862752e
3bacb21
78efd2e
97f2344
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -182,6 +182,7 @@ size_t CompactProtocolWriter::write(ColumnChunkMetaData const& s) | |||||
if (s.index_page_offset != 0) { c.field_int(10, s.index_page_offset); } | ||||||
if (s.dictionary_page_offset != 0) { c.field_int(11, s.dictionary_page_offset); } | ||||||
c.field_struct(12, s.statistics); | ||||||
if (s.size_statistics.has_value()) { c.field_struct(16, s.size_statistics.value()); } | ||||||
return c.value(); | ||||||
} | ||||||
|
||||||
|
@@ -210,6 +211,24 @@ size_t CompactProtocolWriter::write(OffsetIndex const& s) | |||||
{ | ||||||
CompactProtocolFieldWriter c(*this); | ||||||
c.field_struct_list(1, s.page_locations); | ||||||
if (s.unencoded_byte_array_data_bytes.has_value()) { | ||||||
c.field_int_list(2, s.unencoded_byte_array_data_bytes.value()); | ||||||
} | ||||||
return c.value(); | ||||||
} | ||||||
|
||||||
size_t CompactProtocolWriter::write(SizeStatistics const& s) | ||||||
{ | ||||||
CompactProtocolFieldWriter c(*this); | ||||||
if (s.unencoded_byte_array_data_bytes.has_value()) { | ||||||
c.field_int(1, s.unencoded_byte_array_data_bytes.value()); | ||||||
} | ||||||
if (s.repetition_level_histogram.has_value()) { | ||||||
c.field_int_list(2, s.repetition_level_histogram.value()); | ||||||
} | ||||||
if (s.definition_level_histogram.has_value()) { | ||||||
c.field_int_list(3, s.definition_level_histogram.value()); | ||||||
} | ||||||
return c.value(); | ||||||
} | ||||||
|
||||||
|
@@ -298,6 +317,17 @@ inline void CompactProtocolFieldWriter::field_int_list(int field, std::vector<En | |||||
current_field_value = field; | ||||||
} | ||||||
|
||||||
inline void CompactProtocolFieldWriter::field_int_list(int field, const std::vector<int64_t>& val) | ||||||
{ | ||||||
put_field_header(field, current_field_value, ST_FLD_LIST); | ||||||
put_byte((uint8_t)((std::min(val.size(), (size_t)0xfu) << 4) | ST_FLD_I64)); | ||||||
if (val.size() >= 0xf) put_uint(val.size()); | ||||||
for (auto& v : val) { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we get away with
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably. And it doesn't need to be a reference either. Cut and paste error :P |
||||||
put_int(static_cast<int32_t>(v)); | ||||||
} | ||||||
current_field_value = field; | ||||||
} | ||||||
|
||||||
template <typename T> | ||||||
inline void CompactProtocolFieldWriter::field_struct(int field, T const& val) | ||||||
{ | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -51,6 +51,7 @@ class CompactProtocolWriter { | |||||
size_t write(Statistics const&); | ||||||
size_t write(PageLocation const&); | ||||||
size_t write(OffsetIndex const&); | ||||||
size_t write(SizeStatistics const&); | ||||||
size_t write(ColumnOrder const&); | ||||||
|
||||||
protected: | ||||||
|
@@ -90,6 +91,8 @@ class CompactProtocolFieldWriter { | |||||
template <typename Enum> | ||||||
inline void field_int_list(int field, std::vector<Enum> const& val); | ||||||
|
||||||
inline void field_int_list(int field, const std::vector<int64_t>& val); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: We might consider east
Suggested change
Also, pardon my C++, but wouldn't this new function be covered already as a specialization of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice catch...some of the "new" code here predates the great east const migration of 2023 :) Will fix. I'll see if I can get away with the one declaration and keep the int64_t specialization in the source file, since the field type is different (ST_FLD_I64 vs ST_FLD_I32 for the enums). |
||||||
|
||||||
template <typename T> | ||||||
inline void field_struct(int field, T const& val); | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon me, I'm not familiar with cuIO.
Isn't
0xfu
auint32_t
? Why the cast tosize_t
?Also, on the next line, are we not comparing a
uint32_t
(val.size()
) to anint32_t
(0xf)? Should there not be a cast here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old code, but my guess is because
val.size()
is asize_t
, then the other arg tostd::min
has to be as well. But it should be astatic_cast
. As for the next line, it's always a mystery to me what the compiler will complain about and what it will let slide...comparing an unsigned 64-bit int to a signed 32-bit int, you'd think there'd be a warning. 🤷♂️ I'll clean these up some.I went and read Da Rulez, and apparently the comparison is ok because the uint64 has a higher rank than the int32, resulting in the int32 being converted to a uint64 before the comparison. IIUC comparing to
-1
would have been problematic as the-1
would become an unsigned0xffff'ffff'ffff'ffff
😨 (but the compiler would likely flag that). Regardless, I don't want to rely on implicit conversions so changed the literal to0xfUL
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be declared as
0xflu
to make it asize_t
?I've never combined hex and suffixes.