-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2261 Size Statistics #14000
Merged
Merged
PARQUET-2261 Size Statistics #14000
Changes from all commits
Commits
Show all changes
173 commits
Select commit
Hold shift + click to select a range
9589fc3
stub in new SizeStatistics
etseidl 21ab768
use block_reduce to generate histograms. cuts gpuEncodePages time con…
etseidl 76d54cf
move gen_hist to anonymous namespace
etseidl 7d2c41e
add FIXME for ColumnIndex struct
etseidl 58e64a1
update FIXME for column index size calc
etseidl d7a6591
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl b52d226
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl 3de71b0
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl b81d2cc
checkpoint option 2
etseidl a2ad28e
do not write histograms if neither crosses the threshold
etseidl dc3c637
get rid of unnecessary friends
etseidl b61de00
fix bug in histogram calc
etseidl da4aac1
Merge branch 'size_statistics' into sizes_opt2
etseidl 44051f3
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 44e0f23
Merge remote-tracking branch 'origin/branch-23.10' into sizes_opt2
etseidl 6a5b021
Merge branch 'sizes_opt2' into size_statistics
etseidl 51773e3
latest from #197
etseidl 5fc47b9
a few more tweaks
etseidl abba811
add some comments
etseidl be7f7f1
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 5843073
move cutoffs to header
etseidl ee205cd
update column index size calc
etseidl 09c0c49
fix handling of optional list of structs
etseidl 18d4158
add page indexes to ColumnChunk struct
etseidl fb3b6cc
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 94a8d64
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 0c23dca
latest from parquet-2261
etseidl 06f0bef
add TODO
etseidl 74da962
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 00974b8
rename optional functors to match existing precedent
etseidl 4140d30
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl db0779d
rework compact protocol reader
etseidl da2fb0e
Merge branch 'branch-23.10' into size_statistics
etseidl 5376a83
fix for string list
etseidl cf531cf
clean up more inadvertent overloads of field()
etseidl 971931d
clean up enums
etseidl c4126c1
refactor list functors
etseidl fae64c0
fix for int list functor
etseidl 6a50029
more list refactoring
etseidl 9725f70
move functors to cpp file. they're only used by the read methods, so
etseidl c3b6422
refactor compact protocol reader
etseidl e84baf6
use CRTP to get rid of pure virtual
etseidl fa50d62
Revert "use CRTP to get rid of pure virtual"
etseidl 28d71c0
fix get_uxx functions
etseidl a474bec
replace pure virtual read_value with std::function
etseidl 9220f35
rework implementation of the `column_orders` field in file meta data
etseidl 9f2e898
clean up
etseidl a91a196
more cleanup
etseidl fd9e3f8
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl 4a677f6
clean up remaining single-line if statements
etseidl e70b810
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl bf9e073
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl eb519f2
more consts
etseidl 943be91
what can you apply apart from const...more const!
etseidl 1cfe326
clean up header
etseidl 9823f11
Merge branch 'rapidsai:branch-23.10' into refactor_parquet_thrift
etseidl a0a758f
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl ac7b665
add FIXME
etseidl 76f16dd
Merge branch 'refactor_parquet_thrift' of github.com:etseidl/cudf int…
etseidl 4f49ef1
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl e757616
add documentation to empty struct functor
etseidl 71e8eab
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 13200ff
clean up skip_struct_field some more
etseidl 16df9e5
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl 075e11e
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 8a4820b
Merge branch 'rapidsai:branch-23.10' into refactor_parquet_thrift
etseidl 5700b21
convert union to enum with state
etseidl 45f3249
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl 0ade852
use thrust::optional rather than std::optional as some fields may
etseidl aac2f33
Merge branch 'branch-23.10' into refactor_parquet_thrift
etseidl 0ae2fc4
missed a use of std::optional
etseidl f6dcb52
Merge remote-tracking branch 'github/refactor_parquet_thrift' into si…
etseidl 10df4a0
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl 15a2831
more snake case
etseidl aad3908
Merge remote-tracking branch 'origin/branch-23.10' into size_statistics
etseidl ec81897
finish merge
etseidl 9fafa22
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl ae164b9
Merge branch 'branch-23.10' into size_statistics
etseidl 05a7fa2
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 3cc69eb
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 5ac163b
Merge branch 'branch-23.12' into size_statistics
etseidl 9680b3d
Merge branch 'branch-23.12' into size_statistics
etseidl 6854683
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl ec1f138
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl b2a9488
Merge branch 'rapidsai:branch-23.10' into size_statistics
etseidl 348bf11
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 57b64ba
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 3b405c8
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 6d53225
Merge branch 'branch-23.12' into size_statistics
etseidl b53bca1
Merge branch 'branch-23.12' into size_statistics
etseidl e96b225
Merge branch 'branch-23.12' into size_statistics
etseidl 069eb3d
Merge branch 'branch-23.12' into size_statistics
etseidl 4c17cf0
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl fe6f1a8
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 5576e6d
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 3334cec
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl b111ff2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl e4c9911
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 1b0b435
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl e311781
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 07805bb
finish merge
etseidl df13cb7
fix alignment issue
etseidl 2223be2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 551554d
Merge branch 'branch-23.12' into size_statistics
etseidl a028df8
revert some changes not related to PARQUET-2261 (https://github.com/a…
etseidl 244c369
more clean up
etseidl 03ea33a
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 74f25d7
remove initializers for optional members
etseidl f8481c9
clean up comments
etseidl 1bc77cd
more cleanup
etseidl 1eeaee0
add test of histograms and string sizes
etseidl 6e4b9b1
use a faster histogram generator where possible
etseidl 536642c
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 59982b2
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 1393232
add some documentation
etseidl ddc5463
change signature of gen_histograms_list_col
etseidl 7826c0c
change gen_hist to a better name
etseidl b6067ac
fix up leaf nullability check
etseidl cf973f4
better way to figure out leaf nullability
etseidl 8746540
fix docstring
etseidl 43c980f
rename function
etseidl 72b9a57
add some more histogram checks
etseidl ff7ffbc
add some more stats checks
etseidl 69a081b
clean up docs, no need to pass lvl_start as it is always 1
etseidl d4a1dba
clean up another comment
etseidl a6f772d
var_bytes calculation was wrong...need valid count for each fragment
etseidl 12ed759
add num_valid to EncPage
etseidl 717d53b
remove unused aliases
etseidl 551f728
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 1953714
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl cf793d6
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 90829f8
clean up some tech debt found in review
etseidl fa36c23
Merge branch 'size_statistics' of github.com:etseidl/cudf into size_s…
etseidl 3ae2289
do not rely on arithmetic conversion in if statement
etseidl 26fc762
Merge remote-tracking branch 'origin/branch-23.12' into size_statistics
etseidl 18d4939
Merge branch 'branch-23.12' into size_statistics
etseidl 7cb0524
Merge branch 'rapidsai:branch-23.12' into size_statistics
etseidl 9440cd0
Merge branch 'branch-23.12' into size_statistics
etseidl cebc8ee
Merge branch 'branch-24.02' into size_statistics
etseidl e1b427d
implement suggestion from september
etseidl 2155f18
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl 206e741
Merge branch 'branch-24.02' into size_statistics
etseidl 220704a
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl 68dcc58
Merge branch 'branch-24.02' into size_statistics
etseidl ed573ea
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl fe24d51
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl 24c73f8
no need to pass in valid_count as it is on the page now
etseidl 6687b32
fix typo
etseidl 7d2075b
Merge branch 'branch-24.02' into size_statistics
etseidl 335d0e7
clean up some histogram finishing work
etseidl 01fc44f
add helper functions to calculate num data pages
etseidl b06d515
a few fixes from review
etseidl 0dcb674
the spec removed the struct enclosing the histograms
etseidl 9d66218
rework stats encoding to use iterators
etseidl 223bfab
rework host-side histogram collection per review suggestion
etseidl 2001271
use device_uvector rather than hostdevice_vector for histograms
etseidl ef38d1e
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl 7eaad88
fix cut-and-paste error
etseidl 2c4bebb
use num_data_pages in col index size calc
etseidl b458eb6
add comment to explain 5 byte requirement
etseidl 0665183
histograms can be initialized asynchronously
etseidl c5d7e50
missed another place to use num_data_pages()
etseidl 9f859be
get rid of unnecessary casts
etseidl acef3d5
remove unused function
etseidl 6a8a02e
move num_data_pages/num_dict_pages into EncColumnChunk
etseidl 49ea23f
Merge branch 'branch-24.02' into size_statistics
etseidl 0c6b6b5
Merge branch 'branch-24.02' into size_statistics
etseidl 33ab1e9
Merge branch 'branch-24.02' into size_statistics
etseidl 36785e2
Merge branch 'rapidsai:branch-24.02' into size_statistics
etseidl 862752e
clean up some comments
etseidl 3bacb21
Merge remote-tracking branch 'origin/branch-24.02' into size_statistics
etseidl 78efd2e
cast signed to unsigned for comparison
etseidl 97f2344
Merge branch 'branch-24.02' into size_statistics
ttnghia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank @vuule :D