[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

xinrong-meng · 2024-11-19T06:14:51Z

What changes were proposed in this pull request?

Add missing schema check for createDataFrame from numpy ndarray on Spark Connect

Why are the changes needed?

Currently, the conversion from ndarray to pa.table doesn’t consider the schema at all (for e.g.).

If we handle the schema separately for ndarray -> Arrow, it will add additional complexity (for e.g.) and may introduce inconsistencies with Pandas DataFrame behavior—where in Spark Classic, the process is ndarray -> pdf -> Arrow.

To maintain consistency and simplicity, we follow this approach in Spark Connect.

Does this PR introduce any user-facing change?

Schema check and verification aligns with Spark Classic now.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

This reverts commit 244d833.

zhengruifeng · 2024-11-19T10:38:43Z

my concern is that the pandas conversion is out of control, what about introducing a similar check in numpy -> pyarrow?

xinrong-meng added 3 commits November 19, 2024 14:02

try 1

244d833

Revert "try 1"

0f94d5f

This reverts commit 244d833.

to pdf

e9e488b

github-actions bot added SQL PYTHON CONNECT labels Nov 19, 2024

refactor test

fa86f0e

xinrong-meng changed the title ~~[SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray~~ [SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect Nov 19, 2024

xinrong-meng changed the title ~~[SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect~~ [WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect Nov 20, 2024

xinrong-meng marked this pull request as draft November 20, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

xinrong-meng commented Nov 19, 2024 •

edited

Loading

zhengruifeng commented Nov 19, 2024

[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

Are you sure you want to change the base?

[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

Conversation

xinrong-meng commented Nov 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng commented Nov 19, 2024

xinrong-meng commented Nov 19, 2024 •

edited

Loading