Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect #48887

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Nov 19, 2024

What changes were proposed in this pull request?

Add missing schema check for createDataFrame from numpy ndarray on Spark Connect

Why are the changes needed?

Currently, the conversion from ndarray to pa.table doesn’t consider the schema at all (for e.g.).

If we handle the schema separately for ndarray -> Arrow, it will add additional complexity (for e.g.) and may introduce inconsistencies with Pandas DataFrame behavior—where in Spark Classic, the process is ndarray -> pdf -> Arrow.

To maintain consistency and simplicity, we follow this approach in Spark Connect.

Does this PR introduce any user-facing change?

Schema check and verification aligns with Spark Classic now.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@xinrong-meng xinrong-meng changed the title [SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray [SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect Nov 19, 2024
@zhengruifeng
Copy link
Contributor

my concern is that the pandas conversion is out of control, what about introducing a similar check in numpy -> pyarrow?

@xinrong-meng xinrong-meng changed the title [SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect [WIP][SPARK-50323][CONNECT][PYTHON] Add missing schema check for createDataFrame from numpy ndarray on Spark Connect Nov 20, 2024
@xinrong-meng xinrong-meng marked this pull request as draft November 20, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants