-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37212: [C++] IO: Add FromString to ::arrow::io::BufferReader #37360
Conversation
bc7b793
to
cbe761e
Compare
e3ebbb4
to
6b80126
Compare
6b80126
to
f79a72f
Compare
@lidavidm I've re-create a patch here. Would you think the deprecate is ok? |
@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon | |||
|
|||
std::shared_ptr<Schema> dataset_schema; | |||
if (!serialized_dataset_schema.empty()) { | |||
io::BufferReader dataset_schema_reader(serialized_dataset_schema); | |||
// Create a non-owned Buffer to avoid copying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm do you think this is ok? Seems it only return ReadSchema
here, I think just keep zero-copy is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My above lifetime comment does not apply here since there's no way the deserialized Schema
could reference memory in the buffer we're reading; zero copy lgtm
@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon | |||
|
|||
std::shared_ptr<Schema> dataset_schema; | |||
if (!serialized_dataset_schema.empty()) { | |||
io::BufferReader dataset_schema_reader(serialized_dataset_schema); | |||
// Create a non-owned Buffer to avoid copying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be OK.
@@ -83,7 +83,8 @@ Result<std::unique_ptr<FunctionOptions>> GenericOptionsType::Deserialize( | |||
|
|||
Result<std::unique_ptr<FunctionOptions>> DeserializeFunctionOptions( | |||
const Buffer& buffer) { | |||
io::BufferReader stream(buffer); | |||
// Create a non-owned Buffer to avoid copying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me slightly nervous since some function options have scalar or buffer DataMembers which could be zero copied (leaving that one with a buffer which dangles because it points into this non-owning one). I can't immediately identify where they are actually copied, but I wrote this ad hoc test to check that copying does happen somewhere
TEST(TestCumulative, AdHoc) {
CumulativeOptions options;
options.start = MakeScalar("hello");
ASSERT_OK_AND_ASSIGN(auto buf, options.options_type()->Serialize(options));
ASSERT_OK_AND_ASSIGN(auto rt_start,
options.options_type()->Deserialize(*buf).Map([](auto rt) {
return *checked_cast<CumulativeOptions&>(*rt).start;
}));
ASSERT_EQ(rt_start->type->id(), Type::STRING);
auto prbuf = [](const Buffer& buf) {
std::cout << "@" << reinterpret_cast<uintptr_t>(buf.data()) << "[" << buf.size()
<< "]\n";
};
prbuf(*buf);
prbuf(*checked_cast<const StringScalar&>(*rt_start).value);
}
I think we should add (a more formal version of) this test to ensure future refactoring will never admit production of dangling function options. Alternatively we could explicitly ensure deep copying here or in GenericFromScalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Options
should copy and it's not heavy to do this, but schema can zero-copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to BufferReader::FromString
, though other way might be better but I think FunctionOptions
would not be too large, so maybe FromString
it is ok?
@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon | |||
|
|||
std::shared_ptr<Schema> dataset_schema; | |||
if (!serialized_dataset_schema.empty()) { | |||
io::BufferReader dataset_schema_reader(serialized_dataset_schema); | |||
// Create a non-owned Buffer to avoid copying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My above lifetime comment does not apply here since there's no way the deserialized Schema
could reference memory in the buffer we're reading; zero copy lgtm
io::BufferReader stream(buffer); | ||
ARROW_ASSIGN_OR_RAISE(auto reader, ipc::RecordBatchFileReader::Open(&stream)); | ||
// Copying the buffer here is not ideal, but we need to do it to avoid | ||
// lifetime issues with the zero-copy buffer read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which lifetime issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to use-after-free
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But which use-after-free issues exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned here #37360 (comment)
Seems that FunctionOptions
might cannot gurantee that it doesn't has Buffer
member. And it might bound to ::arrow::Buffer
, which cause use-after-free. Should I make it more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can simply switch DeserializeFunctionOptions
to take a std::shared_ptr<Buffer>
? The only place where it's used is PyArrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can separate a new Patch to do this. I think currently the options should not be too large, so copying it is not heavy.
Taking a std::shared_ptr<Buffer>
might be better. But GenericOptionsType::Deserialize
also take a const Buffer&
. I wonder if it's ok to replace it as well. ( Since it's an ARROW_EXPORT
class )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok. We can revisit later if needed.
093f1b6
to
30b4cca
Compare
### Rationale for this change GH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`. ### What changes are included in this PR? Use `BufferReader(std::shared_ptr<Buffer>)` instead. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #37485 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 9639e52. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#37360) ### Rationale for this change Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`. ### What changes are included in this PR? * Add a ctor for FromString(std::string), and deprecate non-owning ctors ### Are these changes tested? Yes. ### Are there any user-facing changes? Some APIs are being deprecated. Users can use the new interface. * Closes: apache#37212 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…pache#37486) ### Rationale for this change apacheGH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`. ### What changes are included in this PR? Use `BufferReader(std::shared_ptr<Buffer>)` instead. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#37485 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…structor (#38721) ### Rationale for this change The PR [GH-37360](#37360) for issue [GH-37212](#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command. ### What changes are included in this PR? This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense. ### Are these changes tested? No, this is a simple docs change. ### Are there any user-facing changes? Yes, this adds a notice to the docs for this particular method. Authored-by: Bryce Mecum <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…apache#37360) ### Rationale for this change Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`. ### What changes are included in this PR? * Add a ctor for FromString(std::string), and deprecate non-owning ctors ### Are these changes tested? Yes. ### Are there any user-facing changes? Some APIs are being deprecated. Users can use the new interface. * Closes: apache#37212 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…pache#37486) ### Rationale for this change apacheGH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`. ### What changes are included in this PR? Use `BufferReader(std::shared_ptr<Buffer>)` instead. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#37485 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…structor (apache#38721) ### Rationale for this change The PR [apacheGH-37360](apache#37360) for issue [apacheGH-37212](apache#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command. ### What changes are included in this PR? This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense. ### Are these changes tested? No, this is a simple docs change. ### Are there any user-facing changes? Yes, this adds a notice to the docs for this particular method. Authored-by: Bryce Mecum <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
Rationale for this change
Previously, when input an non-owned string,
arrow::io::BufferReader
would zero-copy it. It would cause lifetime problem. This patch addFromString
to help build fromstd::string
.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
Some APIs are being deprecated. Users can use the new interface.