Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] CSV option to strip trailing white space after a quoted field. #13892

Closed
revans2 opened this issue Aug 16, 2023 · 0 comments · Fixed by #15727
Closed

[FEA] CSV option to strip trailing white space after a quoted field. #13892

revans2 opened this issue Aug 16, 2023 · 0 comments · Fixed by #15727
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 16, 2023

Is your feature request related to a problem? Please describe.
Spark by default will strip out training white space that appears after a quoted value.

So if I have a CSV file with something like

"A"    ,"B"

Spark will strip out the trailing white space and treat it just like it was

"A","B"

CUDF does not do this and instead when it sees it treats the entry as if they were not quoted at all.

CUDF produces "A" as the value returned, but Spark produces just A.

Describe the solution you'd like
I would love a config flag that would let us do this automatically.

Describe alternatives you've considered
We could also do something similar to what is happening with JSON where we can ask for the string data to be returned with quotes intact, so we could handle cleaning it up ourselves. But I am not sure how that might interact with escaping so Ideally we would just go with the first option.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 16, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 18, 2023
rapids-bot bot pushed a commit that referenced this issue May 22, 2024
This PR adds an option to CSV parsing to detect quotes even if they are surrounded by whitespaces.

Current behavior when `options.keepquotes == false`:
- `"A"` ->  `A`
- `  "A"  ` -> `  "A"  ` (The spaces around the 'A' are not removed and the quotes are kept)

New behavior after enabling the new option:
- `"A"` -> `A`
- `  "A"  ` -> `A`

The new option is false by default to avoid breaking any code that relied on the old behavior.

Closes #13892.

Authors:
  - Mohamed Thabet (https://github.com/thabetx)
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants