Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37212: [C++] IO: Add FromString to ::arrow::io::BufferReader #37360

Merged
merged 9 commits into from
Aug 30, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 24, 2023

Rationale for this change

Previously, when input an non-owned string, arrow::io::BufferReader would zero-copy it. It would cause lifetime problem. This patch add FromString to help build from std::string.

What changes are included in this PR?

  • Add a ctor for FromString(std::string), and deprecate non-owning ctors

Are these changes tested?

Yes.

Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

@mapleFU mapleFU requested a review from wgtmac as a code owner August 24, 2023 14:29
@mapleFU mapleFU force-pushed the add-from-string-to-buffer-reader branch from bc7b793 to cbe761e Compare August 24, 2023 16:36
@mapleFU mapleFU requested a review from lidavidm as a code owner August 24, 2023 17:33
@mapleFU mapleFU force-pushed the add-from-string-to-buffer-reader branch 2 times, most recently from e3ebbb4 to 6b80126 Compare August 24, 2023 17:51
@mapleFU mapleFU force-pushed the add-from-string-to-buffer-reader branch from 6b80126 to f79a72f Compare August 24, 2023 17:52
@mapleFU
Copy link
Member Author

mapleFU commented Aug 24, 2023

@lidavidm I've re-create a patch here. Would you think the deprecate is ok?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 25, 2023
@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon

std::shared_ptr<Schema> dataset_schema;
if (!serialized_dataset_schema.empty()) {
io::BufferReader dataset_schema_reader(serialized_dataset_schema);
// Create a non-owned Buffer to avoid copying
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm do you think this is ok? Seems it only return ReadSchema here, I think just keep zero-copy is ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My above lifetime comment does not apply here since there's no way the deserialized Schema could reference memory in the buffer we're reading; zero copy lgtm

@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon

std::shared_ptr<Schema> dataset_schema;
if (!serialized_dataset_schema.empty()) {
io::BufferReader dataset_schema_reader(serialized_dataset_schema);
// Create a non-owned Buffer to avoid copying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OK.

@lidavidm
Copy link
Member

CC @pitrou or @bkietz any concerns here?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Aug 28, 2023
@@ -83,7 +83,8 @@ Result<std::unique_ptr<FunctionOptions>> GenericOptionsType::Deserialize(

Result<std::unique_ptr<FunctionOptions>> DeserializeFunctionOptions(
const Buffer& buffer) {
io::BufferReader stream(buffer);
// Create a non-owned Buffer to avoid copying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me slightly nervous since some function options have scalar or buffer DataMembers which could be zero copied (leaving that one with a buffer which dangles because it points into this non-owning one). I can't immediately identify where they are actually copied, but I wrote this ad hoc test to check that copying does happen somewhere

TEST(TestCumulative, AdHoc) {
  CumulativeOptions options;
  options.start = MakeScalar("hello");
  ASSERT_OK_AND_ASSIGN(auto buf, options.options_type()->Serialize(options));
  ASSERT_OK_AND_ASSIGN(auto rt_start,
                       options.options_type()->Deserialize(*buf).Map([](auto rt) {
                         return *checked_cast<CumulativeOptions&>(*rt).start;
                       }));
  ASSERT_EQ(rt_start->type->id(), Type::STRING);

  auto prbuf = [](const Buffer& buf) {
    std::cout << "@" << reinterpret_cast<uintptr_t>(buf.data()) << "[" << buf.size()
              << "]\n";
  };
  prbuf(*buf);
  prbuf(*checked_cast<const StringScalar&>(*rt_start).value);
}

I think we should add (a more formal version of) this test to ensure future refactoring will never admit production of dangling function options. Alternatively we could explicitly ensure deep copying here or in GenericFromScalar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Options should copy and it's not heavy to do this, but schema can zero-copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to BufferReader::FromString, though other way might be better but I think FunctionOptions would not be too large, so maybe FromString it is ok?

@@ -519,13 +519,17 @@ arrow::Result<std::shared_ptr<PreparedStatement>> PreparedStatement::ParseRespon

std::shared_ptr<Schema> dataset_schema;
if (!serialized_dataset_schema.empty()) {
io::BufferReader dataset_schema_reader(serialized_dataset_schema);
// Create a non-owned Buffer to avoid copying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My above lifetime comment does not apply here since there's no way the deserialized Schema could reference memory in the buffer we're reading; zero copy lgtm

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Aug 28, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 28, 2023
@github-actions github-actions bot added the awaiting change review Awaiting change review label Aug 28, 2023
@mapleFU mapleFU requested a review from bkietz August 29, 2023 02:06
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Aug 29, 2023
cpp/src/arrow/io/memory.h Outdated Show resolved Hide resolved
io::BufferReader stream(buffer);
ARROW_ASSIGN_OR_RAISE(auto reader, ipc::RecordBatchFileReader::Open(&stream));
// Copying the buffer here is not ideal, but we need to do it to avoid
// lifetime issues with the zero-copy buffer read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which lifetime issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to use-after-free here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But which use-after-free issues exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned here #37360 (comment)

Seems that FunctionOptions might cannot gurantee that it doesn't has Buffer member. And it might bound to ::arrow::Buffer, which cause use-after-free. Should I make it more clear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can simply switch DeserializeFunctionOptions to take a std::shared_ptr<Buffer>? The only place where it's used is PyArrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can separate a new Patch to do this. I think currently the options should not be too large, so copying it is not heavy.

Taking a std::shared_ptr<Buffer> might be better. But GenericOptionsType::Deserialize also take a const Buffer&. I wonder if it's ok to replace it as well. ( Since it's an ARROW_EXPORT class )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. We can revisit later if needed.

@mapleFU mapleFU force-pushed the add-from-string-to-buffer-reader branch from 093f1b6 to 30b4cca Compare August 30, 2023 08:22
@pitrou pitrou merged commit 9639e52 into apache:main Aug 30, 2023
42 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Aug 30, 2023
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Aug 30, 2023
@mapleFU mapleFU deleted the add-from-string-to-buffer-reader branch August 30, 2023 09:45
pitrou pushed a commit that referenced this pull request Aug 31, 2023
### Rationale for this change

GH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`.

### What changes are included in this PR?

Use `BufferReader(std::shared_ptr<Buffer>)` instead.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: #37485

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 9639e52.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#37360)

### Rationale for this change

Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`.

### What changes are included in this PR?

* Add a ctor for FromString(std::string), and deprecate non-owning ctors

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

* Closes: apache#37212

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37486)

### Rationale for this change

apacheGH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`.

### What changes are included in this PR?

Use `BufferReader(std::shared_ptr<Buffer>)` instead.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#37485

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this pull request Nov 23, 2023
…structor (#38721)

### Rationale for this change

The PR [GH-37360](#37360) for issue [GH-37212](#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command.

### What changes are included in this PR?

This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense.

### Are these changes tested?

No, this is a simple docs change.

### Are there any user-facing changes?

Yes, this adds a notice to the docs for this particular method.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…apache#37360)

### Rationale for this change

Previously, when input an non-owned string, `arrow::io::BufferReader` would zero-copy it. It would cause lifetime problem. This patch add `FromString` to help build from `std::string`.

### What changes are included in this PR?

* Add a ctor for FromString(std::string), and deprecate non-owning ctors

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Some APIs are being deprecated. Users can use the new interface.

* Closes: apache#37212

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37486)

### Rationale for this change

apacheGH-37360 deprecated `BufferReader(const uint8_t*, int64_t)`.

### What changes are included in this PR?

Use `BufferReader(std::shared_ptr<Buffer>)` instead.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#37485

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…structor (apache#38721)

### Rationale for this change

The PR [apacheGH-37360](apache#37360) for issue [apacheGH-37212](apache#37212) deprecated three BufferReader constructors and I noticed one of them was missing a `\deprecated` command.

### What changes are included in this PR?

This adds a `\deprecated` command one of the deprecated constructors with a message. The other two didn't need it because they weren't already documented. i.e., they weren't listed under https://arrow.apache.org/docs/cpp/api/io.html so documenting them at this point just to add `\deprecated` wouldn't make sense.

### Are these changes tested?

No, this is a simple docs change.

### Are there any user-facing changes?

Yes, this adds a notice to the docs for this particular method.

Authored-by: Bryce Mecum <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] BufferReader should zero-copy if and only if given a Buffer, not a pointer and length
5 participants