Skip to content

Commit

Permalink
[SPARK-49533][CORE][TESTS] Change default ivySettings in the `IvyTe…
Browse files Browse the repository at this point in the history
…stUtis#withRepository` function to use `.ivy2.5.2` as the Default Ivy User Dir

### What changes were proposed in this pull request?
This pull request introduces changes to the default value of the `ivySettings` parameter in the `IvyTestUtils#withRepository` function. During the construction of the `IvySettings` object, the configurations of `DefaultIvyUserDir` and `DefaultCache` within the instance are modified through an additional call to the `MavenUtils.processIvyPathArg` function:

1. The `DefaultIvyUserDir` is set to `${user.home}/.ivy2.5.2`.
2. The `DefaultCache` is set to the `cache` directory under the modified `IvyUserDir`. By default, the `cache` directory is `${user.home}/.ivy2/cache`.

These alterations are made to address a Badcase in the testing process.

Additionally, to allow `IvyTestUtils` to invoke the `MavenUtils.processIvyPathArg` function, the visibility of the `processIvyPathArg` function has been adjusted from `private` to `private[util]`.

### Why are the changes needed?
To fix a Badcase in the testing, the reproduction steps are as follows:

1. Clean up files and directories related to `mylib-0.1.jar` under `~/.ivy2.5.2`
2. Execute the following tests using Java 21:

```
java -version
openjdk version "21.0.4" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/ added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/maven-central.storage-download.googleapis.com/maven2/org/apache/ivy/ivy/2.5.2/ivy-2.5.2.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/yangjie01/.ivy2.5.2/cache
The jars for the packages stored in: /Users/yangjie01/.ivy2.5.2/jars
my.great.lib#mylib added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4;1.0
	confs: [default]
	found my.great.lib#mylib;0.1 in repo-1
downloading file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-2a9107ea-4e09-4dfe-a270-921d799837fb/my/great/lib/mylib/0.1/mylib-0.1.jar ...
	[SUCCESSFUL ] my.great.lib#mylib;0.1!mylib.jar (1ms)
:: resolution report :: resolve 4325ms :: artifacts dl 2ms
	:: modules in use:
	my.great.lib#mylib;0.1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-5827ff8a-7a85-4598-8ced-e949457752e4
	confs: [default]
	1 artifacts copied, 0 already retrieved (0kB/6ms)
Deleting /Users/yangjie01/.ivy2/cache/my.great.lib, exists: false
[info] - External JAR (6 seconds, 288 milliseconds)
...
[info] Run completed in 40 seconds, 441 milliseconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

3. Re-execute the above tests using Java 17:

```
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Zulu17.52+17-CA (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.52+17-CA (build 17.0.12+7-LTS, mixed mode, sharing)
build/sbt clean "connect-client-jvm/testOnly org.apache.spark.sql.application.ReplE2ESuite" -Phive
```

```
[info] - External JAR *** FAILED *** (1 second, 626 milliseconds)
[info]   isContain was false Ammonite output did not contain 'Array[Int] = Array(1, 2, 3, 4, 5)':
[info]   scala>

[info]   scala> // this import will fail

[info]   scala> import my.great.lib.MyLib

[info]   scala>

[info]   scala> // making library available in the REPL to compile UDF

[info]   scala> import coursierapi.{Credentials, MavenRepository}
import coursierapi.{Credentials, MavenRepository}
[info]
[info]   scala> interp.repositories() ++= Seq(MavenRepository.of("file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/"))

[info]
[info]   scala> import $ivy.`my.great.lib:mylib:0.1`
import $ivy.$
[info]
[info]   scala>

[info]   scala> val func = udf((a: Int) => {
[info]            import my.great.lib.MyLib
[info]            MyLib.myFunc(a)
[info]          })
func: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction(
[info]     f = ammonite.$sess.cmd28$Helper$$Lambda$3059/0x0000000801da4218721b2487,
[info]     dataType = IntegerType,
[info]     inputEncoders = ArraySeq(Some(value = PrimitiveIntEncoder)),
[info]     outputEncoder = Some(value = BoxedIntEncoder),
[info]     givenName = None,
[info]     nullable = true,
[info]     deterministic = true
[info]   )
[info]
[info]   scala>

[info]   scala> // add library to the Executor

[info]   scala> spark.addArtifact("ivy://my.great.lib:mylib:0.1?repos=file:/Users/yangjie01/SourceCode/git/spark-sbt/target/tmp/spark-6e6bc234-758f-44f1-a8b3-fbb79ed74647/")

[info]
[info]   scala>

[info]   scala> spark.range(5).select(func(col("id"))).as[Int].collect()

[info]   scala>

[info]   scala> semaphore.release()

[info]   Error Output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc
[info]   Compiling /Users/yangjie01/SourceCode/git/spark-sbt/connector/connect/client/jvm/(console)
[info]   cmd25.sc:1: not found: value my
[info]   import my.great.lib.MyLib
[info]          ^
[info]   Compilation Failed
[info]   org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (` (cmd28$Helper$$Lambda$3054/0x0000007002189800)`: (int) => int) failed due to: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0. SQLSTATE: 39000
[info]     org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:195)
[info]     org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:114)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840)
[info]   org.apache.spark.SparkException: java.lang.UnsupportedClassVersionError: my/great/lib/MyLib has been compiled by a more recent version of the Java Runtime (class file version 65.0), this version of the Java Runtime only recognizes class file versions up to 61.0
[info]     java.lang.ClassLoader.defineClass1(Native Method)
[info]     java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
[info]     java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
[info]     java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:427)
[info]     java.net.URLClassLoader$1.run(URLClassLoader.java:421)
[info]     java.security.AccessController.doPrivileged(AccessController.java:712)
[info]     java.net.URLClassLoader.findClass(URLClassLoader.java:420)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:55)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:579)
[info]     org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:592)
[info]     java.lang.ClassLoader.loadClass(ClassLoader.java:525)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1(cmd28.sc:3)
[info]     ammonite.$sess.cmd28$Helper.$anonfun$func$1$adapted(cmd28.sc:1)
[info]     org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:112)
[info]     org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]     org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
[info]     org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.hasNext(ArrowConverters.scala:100)
[info]     scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
[info]     scala.collection.mutable.Growable.addAll(Growable.scala:61)
[info]     scala.collection.mutable.Growable.addAll$(Growable.scala:57)
[info]     scala.collection.mutable.ArrayBuilder.addAll(ArrayBuilder.scala:75)
[info]     scala.collection.IterableOnceOps.toArray(IterableOnce.scala:1505)
[info]     scala.collection.IterableOnceOps.toArray$(IterableOnce.scala:1498)
[info]     scala.collection.AbstractIterator.toArray(Iterator.scala:1303)
[info]     org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$5(SparkConnectPlanExecution.scala:183)
[info]     org.apache.spark.SparkContext.$anonfun$submitJob$1(SparkContext.scala:2608)
[info]     org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
[info]     org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
[info]     org.apache.spark.scheduler.Task.run(Task.scala:146)
[info]     org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
[info]     org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
[info]     org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
[info]     org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)
[info]     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]     java.lang.Thread.run(Thread.java:840) (ReplE2ESuite.scala:117)
```

The reasons I suspect for the aforementioned bad case are as follows:

1. Following #45075, to address compatibility issues, Spark 4.0 adopted `~/.ivy2.5.2` as the default Ivy user directory. When tests are executed with Java 21, the compiled `mylib-0.1.jar` is published to the directory `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`.

2. However, the `getDefaultCache` method within the default `IvySettings` instance still returns `~/.ivy2/cache`. Consequently, when the `purgeLocalIvyCache` function is called within the `withRepository` function, it attempts to clean the `artifact` and `deps` directories under `~/.ivy2/cache`. This results in the failure to effectively clean up the `mylib-0.1.jar` file located at `~/.ivy2.5.2/cache/my.great.lib/mylib/jars`, which was originally published by Java 21. Subsequently, when tests are executed with Java 17 and attempt to load this Java 21-compiled `mylib-0.1.jar`, the tests fail.

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L361-L371

https://github.com/apache/spark/blob/9269a0bfed56429e999269dfdfd89aefcb1b7261/common/utils/src/test/scala/org/apache/spark/util/IvyTestUtils.scala#L392-L403

To address this issue, the pull request modifies the default configuration of the `IvySettings` instance, ensuring that `purgeLocalIvyCache` is able to properly clean up the corresponding cache files located in `~/.ivy2.5.2/cache`. This resolution fixes the aforementioned problem.

### Does this PR introduce _any_ user-facing change?
No, just for test

### How was this patch tested?
1. Pass GitHub Actions
2. Manually executing the tests described in the pull request results in success, and it is confirmed that the `~/.ivy2.5.2/cache/my.great.lib` directory is cleaned up promptly.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #48006 from LuciferYang/IvyTestUtils-withRepository.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
LuciferYang authored and dongjoon-hyun committed Sep 6, 2024
1 parent 62cdc56 commit b5e345c
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ private[spark] object MavenUtils extends Logging {
}

/* Set ivy settings for location of cache, if option is supplied */
private def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {
private[util] def processIvyPathArg(ivySettings: IvySettings, ivyPath: Option[String]): Unit = {
val alternateIvyDir = ivyPath.filterNot(_.trim.isEmpty).getOrElse {
// To protect old Ivy-based systems like old Spark from Apache Ivy 2.5.2's incompatibility.
System.getProperty("ivy.home",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ private[spark] object IvyTestUtils {
useIvyLayout: Boolean = false,
withPython: Boolean = false,
withR: Boolean = false,
ivySettings: IvySettings = new IvySettings)(f: String => Unit): Unit = {
ivySettings: IvySettings = defaultIvySettings())(f: String => Unit): Unit = {
val deps = dependencies.map(MavenUtils.extractMavenCoordinates)
purgeLocalIvyCache(artifact, deps, ivySettings)
val repo = createLocalRepositoryForTests(artifact, dependencies, rootDir, useIvyLayout,
Expand Down Expand Up @@ -401,4 +401,16 @@ private[spark] object IvyTestUtils {
}
}
}

/**
* Creates and initializes a new instance of IvySettings with default configurations.
* The method processes the Ivy path argument using MavenUtils to ensure proper setup.
*
* @return A newly created and configured instance of IvySettings.
*/
private def defaultIvySettings(): IvySettings = {
val settings = new IvySettings
MavenUtils.processIvyPathArg(ivySettings = settings, ivyPath = None)
settings
}
}

0 comments on commit b5e345c

Please sign in to comment.