Skip to content

Commit

Permalink
[MINOR][DOCS] Add a migration guide for encode/decode unmappable char…
Browse files Browse the repository at this point in the history
…acters

### What changes were proposed in this pull request?

Add a migration guide for encode/decode unmappable characters

### Why are the changes needed?

Providing upgrading guides for users

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
passing doc build

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49058 from yaooqinn/minor.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
  • Loading branch information
yaooqinn authored and MaxGekk committed Dec 4, 2024
1 parent 3d063a0 commit 74c3757
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ license: |
- Since Spark 4.0, `spark.sql.parquet.compression.codec` drops the support of codec name `lz4raw`, please use `lz4_raw` instead.
- Since Spark 4.0, when overflowing during casting timestamp to byte/short/int under non-ansi mode, Spark will return null instead a wrapping value.
- Since Spark 4.0, the `encode()` and `decode()` functions support only the following charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', 'UTF-32'. To restore the previous behavior when the function accepts charsets of the current JDK used by Spark, set `spark.sql.legacy.javaCharsets` to `true`.
- Since Spark 4.0, the `encode()` and `decode()` functions raise `MALFORMED_CHARACTER_CODING` error when handling unmappable characters, while in Spark 3.5 and earlier versions, these characters will be replaced with mojibakes. To restore the previous behavior, set `spark.sql.legacy.codingErrorAction` to `true`. For example, if you try to `decode` a string value `tést` / [116, -23, 115, 116] (encoded in latin1) with 'UTF-8', you get `t�st`.
- Since Spark 4.0, the legacy datetime rebasing SQL configs with the prefix `spark.sql.legacy` are removed. To restore the previous behavior, use the following configs:
- `spark.sql.parquet.int96RebaseModeInWrite` instead of `spark.sql.legacy.parquet.int96RebaseModeInWrite`
- `spark.sql.parquet.datetimeRebaseModeInWrite` instead of `spark.sql.legacy.parquet.datetimeRebaseModeInWrite`
Expand Down

0 comments on commit 74c3757

Please sign in to comment.