Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49678][CORE] Support spark.test.master in SparkSubmitArguments #48126

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 16, 2024

What changes were proposed in this pull request?

This PR aims to support spark.test.master in SparkSubmitArguments.

Why are the changes needed?

To allow users to control the default master setting during testing and documentation generation.

First, currently, we cannot build Python Documentation on M3 Max (and high-core machines) without this. Only it succeeds on GitHub Action runners (4 cores) or equivalent low-core docker run. Please try the following on your Macs.

BEFORE

$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2

AFTER

$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" make html
...
build succeeded.

The HTML pages are in build/html.

Second, in general, we can control all SparkSubmit (eg. Spark Shells) like the following.

BEFORE (local[*])

$ bin/pyspark
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:53:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1726519982935).
SparkSession available as 'spark'.
>>>

AFTER (local[1])

$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" bin/pyspark
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[1], app id = local-1726519863363).
SparkSession available as 'spark'.
>>>

Does this PR introduce any user-facing change?

No. spark.test.master is a new parameter.

How was this patch tested?

Manual tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Sep 16, 2024
@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @viirya ?

@dongjoon-hyun
Copy link
Member Author

Please note that Python package installation failures are irrelevant to this PR and it seems to be happening in other recent PRs too.

Screenshot 2024-09-16 at 16 11 15

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@dongjoon-hyun
Copy link
Member Author

Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-49678 branch September 17, 2024 03:53
@@ -43,7 +43,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
extends SparkSubmitArgumentsParser with Logging {
var maybeMaster: Option[String] = None
// Global defaults. These should be keep to minimum to avoid confusing behavior.
def master: String = maybeMaster.getOrElse("local[*]")
def master: String =
maybeMaster.getOrElse(System.getProperty("spark.test.master", "local[*]"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is fine but just curious. Why can't we just set spark.master instead? The Python documentation build actually works fine in my mac fwiw.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can change it to accept spark.master here.

Why can't we just set spark.master instead?

For the following questions, I guess you already have #48129 to limit the number of SparkSubmit. It mitigate the situation. This PR and #48129 aims to two layers (Spark Submission and Spark local executor cores) during Python documentation generation. This PR is still valid for many core machines like c6i.24xlarge.

The Python documentation build actually works fine in my mac fwiw.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the reason why I didn't use spark.master directly is that it could have some side effects in K8s side. Let me check that part to make ti sure, @HyukjinKwon .

@dongjoon-hyun
Copy link
Member Author

I made a follow-up to address the comments, @HyukjinKwon . After looking at the code, it seems to okay. Currently, the follow-up PR is running the CIs. If CI passes, it will be okay. Thank you for the suggestion.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 18, 2024

To @HyukjinKwon , I closed the above follow-up and kept this original PR. For the comment (#48134 (comment)),

Could we maybe just reuse 'PYSPARK_SUBMIT_ARGS' environment variable set to '--master local[1]' if this is just dev/test only? I haven't tested cuz I am away from keyboard today but I think it should work.

This PR has more contributions on top of Python Doc Generations. For example, PYSPARK_SUBMIT_ARGS cannot support spark-shell and spark-sql and so on.

$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" bin/spark-shell
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 21.0.4)
Type in expressions to have them evaluated.
Type :help for more information.
24/09/17 17:44:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[1], app id = local-1726620243965).
Spark session available as 'spark'.

scala>
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" bin/spark-sql
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/17 17:44:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/17 17:44:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/09/17 17:44:22 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected]
24/09/17 17:44:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
Spark Web UI available at http://localhost:4040
Spark master: local[1], Application Id: local-1726620260476
spark-sql (default)>

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 18, 2024

Sure that's fine! Thanks for taking a look.

@dongjoon-hyun
Copy link
Member Author

Thank you. I'll this as a test setting for now~ We can revisit this later.

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…nts`

### What changes were proposed in this pull request?

This PR aims to support `spark.test.master` in `SparkSubmitArguments`.

### Why are the changes needed?

To allow users to control the default master setting during testing and documentation generation.

#### First, currently, we cannot build `Python Documentation` on M3 Max (and high-core machines) without this. Only it succeeds on GitHub Action runners (4 cores) or equivalent low-core docker run. Please try the following on your Macs.

**BEFORE**
```
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2
```

**AFTER**
```
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" make html
...
build succeeded.

The HTML pages are in build/html.
```

#### Second, in general, we can control all `SparkSubmit` (eg. Spark Shells) like the following.

**BEFORE (`local[*]`)**
```
$ bin/pyspark
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:53:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1726519982935).
SparkSession available as 'spark'.
>>>
```

**AFTER (`local[1]`)**
```
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" bin/pyspark
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[1], app id = local-1726519863363).
SparkSession available as 'spark'.
>>>
```

### Does this PR introduce _any_ user-facing change?

No. `spark.test.master` is a new parameter.

### How was this patch tested?

Manual tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48126 from dongjoon-hyun/SPARK-49678.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…nts`

### What changes were proposed in this pull request?

This PR aims to support `spark.test.master` in `SparkSubmitArguments`.

### Why are the changes needed?

To allow users to control the default master setting during testing and documentation generation.

#### First, currently, we cannot build `Python Documentation` on M3 Max (and high-core machines) without this. Only it succeeds on GitHub Action runners (4 cores) or equivalent low-core docker run. Please try the following on your Macs.

**BEFORE**
```
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ make html
...
java.lang.OutOfMemoryError: Java heap space
...
24/09/16 14:09:55 WARN PythonRunner: Incomplete task 7.0 in stage 30 (TID 177) interrupted: Attempting to kill Python Worker
...
make: *** [html] Error 2
```

**AFTER**
```
$ build/sbt package -Phive-thriftserver
$ cd python/docs
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" make html
...
build succeeded.

The HTML pages are in build/html.
```

#### Second, in general, we can control all `SparkSubmit` (eg. Spark Shells) like the following.

**BEFORE (`local[*]`)**
```
$ bin/pyspark
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:53:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1726519982935).
SparkSession available as 'spark'.
>>>
```

**AFTER (`local[1]`)**
```
$ JDK_JAVA_OPTIONS="-Dspark.test.master=local[1]" bin/pyspark
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
Python 3.9.19 (main, Jun 17 2024, 15:39:29)
[Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
NOTE: Picked up JDK_JAVA_OPTIONS: -Dspark.test.master=local[1]
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-pattern-layout-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/16 13:51:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.9.19 (main, Jun 17 2024 15:39:29)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[1], app id = local-1726519863363).
SparkSession available as 'spark'.
>>>
```

### Does this PR introduce _any_ user-facing change?

No. `spark.test.master` is a new parameter.

### How was this patch tested?

Manual tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48126 from dongjoon-hyun/SPARK-49678.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants