Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-42859][CONNECT][PS] Basic support for pandas API on Spark Connect
### What changes were proposed in this pull request? This PR proposes to support pandas API on Spark for Spark Connect. This PR includes minimal changes to support basic functionality of the pandas API in Spark Connect, and sets up a testing environment into `pyspark/pandas/tests/connect` using all existing pandas API on Spark test bases to test the functionality of the pandas API on Spark in a remote Spark session. Here is a summary of the key tasks: 1. All pandas-on-Spark tests under the `python/pyspark/pandas/tests/` directory can now be performed in Spark Connect by adding corresponding tests to the `python/pyspark/pandas/tests/connect/` directory. 2. Unlike with Spark SQL, we did not create a separate package directory such as `python/pyspark/sql/connect` for Spark Connect, so I modified the existing files of `pyspark.pandas`. This allows users to use the existing pandas-on-Spark code as it is on Spark Connect. 3. Because of 2, I added two typing rules into `python/pyspark/pandas/_typing.py` for addressing both PySpark Column and Spark Connect Column in the single path. - Added `GenericColumn` for typing both PySpark Column and Spark Connect Column. - Added `GenericDataFrame` for typing both PySpark DataFrame and Spark Connect DataFrame. ### Why are the changes needed? By supporting the pandas API in Spark Connect, it can significantly improve the usability for existing PySpark and pandas users. ### Does this PR introduce _any_ user-facing change? No, because it is designed to allow existing code for regular Spark sessions to be used without any user-facing changes other than switching the regular Spark session to remote Spark session. However, since some features of the existing pandas API on Spark are not fully supported yet, some features may be limited. ### How was this patch tested? A testing bed has been set up to reproduce all existing pandas-on-Spark tests for Spark Connect, ensuring that the existing tests can be replicated in Spark Connect. The current result for all tests as below: | Test file | Test total | Test passed | Coverage | | --------------------------------------------------- | ---------- | ----------- | -------- | | test_parity_dataframe.py | 105 | 85 | 80.95% | | test_parity_dataframe_slow.py | 66 | 48 | 72.73% | | test_parity_dataframe_conversion.py | 11 | 11 | 100.00% | | test_parity_dataframe_spark_io.py | 8 | 7 | 87.50% | | test_parity_ops_on_diff_frames.py | 75 | 75 | 100.00% | | test_parity_series.py | 131 | 104 | 79.39% | | test_parity_series_datetime.py | 41 | 34 | 82.93% | | test_parity_categorical.py | 29 | 22 | 75.86% | | test_parity_config.py | 7 | 7 | 100.00% | | test_parity_csv.py | 18 | 18 | 100.00% | | test_parity_default_index.py | 4 | 1 | 25.00% | | test_parity_ewm.py | 3 | 1 | 33.33% | | test_parity_expanding.py | 22 | 2 | 9.09% | | test_parity_extention.py | 7 | 7 | 100.00% | | test_parity_frame_spark.py | 6 | 2 | 33.33% | | test_parity_generic_functions.py | 4 | 1 | 25.00% | | test_parity_groupby.py | 49 | 36 | 73.47% | | test_parity_groupby_slow.py | 205 | 147 | 71.71% | | test_parity_indexing.py | 3 | 3 | 100.00% | | test_parity_indexops_spark.py | 3 | 3 | 100.00% | | test_parity_internal.py | 1 | 0 | 0.00% | | test_parity_namespace.py | 29 | 26 | 89.66% | | test_parity_numpy_compat.py | 6 | 4 | 66.67% | | test_parity_ops_on_diff_frames_groupby.py | 22 | 13 | 59.09% | | test_parity_ops_on_diff_frames_groupby_expanding.py | 7 | 0 | 0.00% | | test_parity_ops_on_diff_frames_groupby_rolling.py | 7 | 0 | 0.00% | | test_parity_ops_on_diff_frames_slow.py | 22 | 15 | 68.18% | | test_parity_repr.py | 5 | 5 | 100.00% | | test_parity_resample.py | 5 | 3 | 60.00% | | test_parity_reshape.py | 10 | 8 | 80.00% | | test_parity_rolling.py | 21 | 1 | 4.76% | | test_parity_scalars.py | 1 | 1 | 100.00% | | test_parity_series_conversion.py | 2 | 2 | 100.00% | | test_parity_series_string.py | 56 | 55 | 98.21% | | test_parity_spark_functions.py | 1 | 1 | 100.00% | | test_parity_sql.py | 7 | 4 | 57.14% | | test_parity_stats.py | 15 | 7 | 46.67% | | test_parity_typedef.py | 10 | 10 | 100.00% | | test_parity_utils.py | 5 | 5 | 100.00% | | test_parity_window.py | 2 | 2 | 100.00% | | test_parity_frame_plot.py | 7 | 5 | 71.43% | | plot/test_parity_frame_plot_matplotlib.py | 13 | 11 | 84.62% | | plot/test_parity_frame_plot_plotly.py | 12 | 9 | 75.00% | | plot/test_parity_series_plot.py | 3 | 3 | 100.00% | | plot/test_parity_series_plot_matplotlib.py | 14 | 8 | 57.14% | | plot/test_parity_series_plot_plotly.py | 9 | 7 | 77.78% | | indexes/test_parity_base.py | 144 | 75 | 52.08% | | indexes/test_parity_category.py | 16 | 7 | 43.75% | | indexes/test_parity_datetime.py | 13 | 11 | 84.62% | | indexes/test_parity_timedelta.py | 2 | 1 | 50.00% | | data_type_ops/test_parity_base.py | 2 | 2 | 100.00% | | data_type_ops/test_parity_binary_ops.py | 30 | 25 | 83.33% | | data_type_ops/test_parity_boolean_ops.py | 31 | 26 | 83.87% | | data_type_ops/test_parity_categorical_ops.py | 30 | 23 | 76.67% | | data_type_ops/test_parity_complex_ops.py | 30 | 30 | 100.00% | | data_type_ops/test_parity_date_ops.py | 30 | 25 | 83.33% | | data_type_ops/test_parity_datetime_ops.py | 30 | 25 | 83.33% | | data_type_ops/test_parity_null_ops.py | 26 | 19 | 73.08% | | data_type_ops/test_parity_num_ops.py | 33 | 25 | 75.76% | | data_type_ops/test_parity_string_ops.py | 30 | 23 | 76.67% | | data_type_ops/test_parity_timedelta_ops.py | 26 | 19 | 73.08% | | data_type_ops/test_parity_udf_ops.py | 26 | 18 | 69.23% | | Total | 1588 | 1173 | 73.87% | Closes apache#40525 from itholic/initial_pandas_connect. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information