Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling '\u0000' in Parquet Data Causes Error: Unsupported Unicode Escape Sequence #188

Open
2 tasks done
kysshsy opened this issue Jan 2, 2025 · 2 comments
Open
2 tasks done
Labels
bug Something isn't working priority-medium Medium priority issue user-request This issue was directly requested by a user

Comments

@kysshsy
Copy link
Contributor

kysshsy commented Jan 2, 2025

What happens?

Some specific data may cause errors(ERROR: unsupported Unicode escape sequence). The data contains '\u0000'.

pg_analytics=# select * from tulu_3_sft_mixture1 ;
ERROR:  unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: ...产生一种隐约的紧张感。因此,\u0000...

To Reproduce

After #187 merged. And load dataset on Huggingface.

OS:

x86

ParadeDB Version:

v0.2.4

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_analytics Extension

Full Name:

kysshsy

Affiliation:

NA

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@kysshsy kysshsy added the bug Something isn't working label Jan 2, 2025
@philippemnoel philippemnoel added priority-medium Medium priority issue user-request This issue was directly requested by a user labels Jan 2, 2025
@kysshsy
Copy link
Contributor Author

kysshsy commented Jan 18, 2025

Seem that Postgres will not support including \u0000 in JSONB and TEXT types.

However, we have a workaround: convert it to the JSON type or specify it when creating the table.

@philippemnoel
Copy link
Collaborator

Seem that Postgres will not support including \u0000 in JSONB and TEXT types.

However, we have a workaround: convert it to the JSON type or specify it when creating the table.

That sounds like a good workaround to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-medium Medium priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

2 participants