Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23464][MESOS] Fix mesos cluster scheduler options double-escaping #20641

Closed
wants to merge 2 commits into from

Conversation

krcz
Copy link

@krcz krcz commented Feb 20, 2018

What changes were proposed in this pull request?

Don't enclose --conf option value with "", as they are already escaped by shellEscape
and additional wrapping cancels out this escaping, making the driver fail to start.

This reverts commit 9b377aa [SPARK-18114].

How was this patch tested?

Manual test with driver command logging added.

Author: Marcin Kurczych [email protected]

Don't enclose --conf option value with "", as they are already escaped by shellEscape
and additional wrapping cancels out this escaping, causing driver fail to start.

This reverts commit 9b377aa [SPARK-18114].

Manual test with driver command logging added.

Author: Marcin Kurczych <[email protected]>
@susanxhuynh
Copy link
Contributor

Thanks for the PR! It seems that the previous attempt to fix this (SPARK-18114) was wrong -- I'm not sure why we didn't catch the problem before, maybe lack of testing? @krcz My suggestion for this patch is to add a test, in order to prevent another regression in the future. I've written a unit test for this -- you could do something similar: d2iq-archive@4812ba3 I will also do more testing with my own integration tests. cc @skonto

@krcz
Copy link
Author

krcz commented Mar 1, 2018

@susanxhuynh Thanks for the example! I have added similar test, covering more cases. After you or someone else reviews it, I'll rebase the pull request.

Copy link
Contributor

@susanxhuynh susanxhuynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @krcz . Overall looks good ... I left one question about one of the test parameters. Also, have you verified in the driver / executors that they receive the correct Java system properties (with the "-D" options)? -- I was going to test that myself.

"-XX+PrintGC -Dparam1=val1 -Dparam2=val2",
// special characters, to be escaped
"spark.executor.extraJavaOptions" ->
"""-Dparam1="value 1" -Dparam2=value\ 2 -Dpath=$PATH"""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the second param (param2) would be parsed correctly if you actually ran the command. Doesn't there need to be quotes around the space? Have you tested it and checked if the executor gets the correct value for param2?

Copy link
Author

@krcz krcz Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susanxhuynh
I have checked it and it worked. There don't need to be quotes, as the space has been escaped. Backslash stops it from being interpreted as a boundary between arguments and makes it being understood as simple space in the value. This is standard behaviour of bash (and sh) syntax.

@@ -199,6 +199,38 @@ class MesosClusterSchedulerSuite extends SparkFunSuite with LocalSparkContext wi
})
}

test("properly wraps and escapes parameters passed to driver command") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test fail with the old code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does.

[info] - properly wraps and escapes parameters passed to driver command *** FAILED *** (154 milliseconds)
[info]   "test/./bin/spark-submit --name test --master mesos://mesos://localhost:5050 --driver-cores 1.0 --driver-memory 1000M --class mainClass --conf "spark.app.name=test" --conf "spark.mesos.executor.home=tes
t" --conf "spark.executor.extraJavaOptions="-Dparam1=\"value 1\" -Dparam2=value\\ 2 -Dpath=\$PATH"" --conf "spark.driver.extraJavaOptions="-XX+PrintGC -Dparam1=val1 -Dparam2=val2"" ./jar arg" did not contain "--
conf spark.driver.extraJavaOptions="-XX+PrintGC -Dparam1=val1 -Dparam2=val2"" (MesosClusterSchedulerSuite.scala:227)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info]   at org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSuite$$anonfun$16.apply(MesosClusterSchedulerSuite.scala:227)
[info]   at org.apache.spark.scheduler.cluster.mesos.MesosClusterSchedulerSuite$$anonfun$16.apply(MesosClusterSchedulerSuite.scala:202)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready to merge; anything else needed @skonto or @susanxhuynh ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great to merge it if you think it's ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I was going to test this in DC/OS and haven't gotten a chance to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susanxhuynh I can do a test for this.

@skonto
Copy link
Contributor

skonto commented Jun 12, 2018

@susanxhuynh I tested this. Let me know if I need to test anything else. Here is what you get:

 ./bin/spark-submit --deploy-mode cluster --name test --master mesos://spark.marathon.mesos:13830 --driver-cores 1.0 --driver-memory 1000M --class org.apache.spark.examples.SparkPi --conf spark.executor.extraJavaOptions="-Dparam1=\"value 1\" -Dparam2=value\\ 2 -Dpath=\$PATH" --conf spark.mesos.containerizer=mesos  --conf spark.mesos.executor.docker.image=skonto/spark-test:test https://...jar 10000

"sparkProperties" : {
"spark.executor.extraJavaOptions" : "-Dparam1="value 1" -Dparam2=value\ 2 -Dpath=$PATH",
"spark.mesos.containerizer" : "mesos",
"spark.jars" : "....jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "test",
"spark.driver.memory" : "1000M",
"spark.mesos.executor.docker.image" : "skonto/spark-test:test",
"spark.driver.cores" : "1.0",
"spark.submit.deployMode" : "cluster",
"spark.master" : "mesos://spark.marathon.mesos:13830"
}
}

At the driver side in DC/OS:

/usr/lib/jvm/jre1.8.0_152/bin/java -cp /opt/spark/dist/conf/:/opt/spark/dist/jars/*:/etc/hadoop/ -Dspark.mesos.driver.frameworkId=bddf96ce-45f6-45f7-ac83-5a3622cafc41-0030-driver-20180612154509-0007 -Xmx1000M org.apache.spark.deploy.SparkSubmit --master mesos://zk://master.mesos:2181/mesos --conf spark.driver.memory=1000M --conf spark.driver.cores=1.0 --conf spark.mesos.executor.docker.image=skonto/spark-test:test --conf spark.app.name=test --conf spark.mesos.containerizer=mesos --conf spark.executor.extraJavaOptions=-Dparam1="value 1" -Dparam2=value\ 2 -Dpath=$PATH --conf spark.driver.supervise=false --class org.apache.spark.examples.SparkPi --name test --driver-cores 1.0 /mnt/mesos/sandbox/....jar 10000

Also tried:
--conf spark.executor.extraJavaOptions="'--verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'"

Got:

/usr/lib/jvm/jre1.8.0_152/bin/java -cp /opt/spark/dist/conf/:/opt/spark/dist/jars/*:/etc/hadoop/ -Dspark.mesos.driver.frameworkId=bddf96ce-45f6-45f7-ac83-5a3622cafc41-0030-driver-20180612155144-0008 -Xmx1000M org.apache.spark.deploy.SparkSubmit --master mesos://zk://master.mesos:2181/mesos --conf spark.driver.memory=1000M --conf spark.driver.cores=1.0 --conf spark.mesos.executor.docker.image=skonto/spark-test:test --conf spark.app.name=test --conf spark.mesos.containerizer=mesos --conf spark.executor.extraJavaOptions=--verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps --conf spark.driver.supervise=false --class org.apache.spark.examples.SparkPi --name test --driver-cores 1.0 /mnt/mesos/sandbox/....jar 10000

"sparkProperties" : {
"spark.executor.extraJavaOptions" : "'--verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'",
"spark.mesos.containerizer" : "mesos",
"spark.jars" : ".....jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "test",
"spark.driver.memory" : "1000M",
"spark.mesos.executor.docker.image" : "skonto/spark-test:test",
"spark.driver.cores" : "1.0",
"spark.submit.deployMode" : "cluster",
"spark.master" : "mesos://spark.marathon.mesos:13830"
}
@krcz Could you update the PR to remove conflicts?
@felixcheung could you help merge this one once updated?

@krcz
Copy link
Author

krcz commented Jun 12, 2018

@skonto
The command you have pasted as "At the driver side in DC/OS:" concerns me. It doesn't seem escaped properly. Is it the command to be run by shell or is it just space-joined arguments list (as in Spark Dispatcher web UI), which would explain why it is not escaped?

@skonto
Copy link
Contributor

skonto commented Jun 12, 2018

@krcz that is the output of ps, probably not the best tool here. Btw both jobs run successfully although might not have been escaped properly. I will check with some other tool like printing the exact commands or something plus what the process get as java properties. Spark ui shows everything as expected btw.

@krcz
Copy link
Author

krcz commented Jun 13, 2018

@skonto The one directly below "At the driver side in DC/OS:" in your comment. But if it's output of ps that's ok, as ps wouldn't be able to discern between command "foo bar" and command foo bar.

@skonto
Copy link
Contributor

skonto commented Jun 13, 2018

@krcz yes it is ok, yes ps has limitations. So I verified it just to be on the safe side:
Dispatcher passes the following command to mesos:

18/06/13 10:30:45 INFO MesosClusterScheduler: Command for the driver (test):./bin/spark-submit --name test --master mesos://zk://master.mesos:2181/mesos --driver-cores 1.0 --driver-memory 1000M --class org.apache.spark.examples.SparkPi --conf spark.executor.extraJavaOptions="-Dexecutor.test.param1=\"value 1\" -Dexecutor.test.param2=value\\ 2 -Dexecutor.test.path=\$PATH" --conf spark.mesos.containerizer=mesos --conf spark.driver.supervise=false --conf spark.app.name=test --conf spark.driver.memory=1000M --conf spark.mesos.executor.docker.image=skonto/spark-test:test --conf spark.driver.cores=1.0 --conf spark.driver.extraJavaOptions="-Dspark.test.param1=\"value 1\" -Dspark.test.param2=value\\ 2 -Dspark.test.path=\$PATH" $MESOS_SANDBOX/spark-examples_2.11-2.3.0.jar 10000

Spark-class will run this at the executor side:

exec /usr/lib/jvm/jre1.8.0_152/bin/java -cp '/opt/spark/dist/conf/:/opt/spark/dist/jars/*:/etc/hadoop/' '-Dexecutor.test.param1=value 1' '-Dexecutor.test.param2=value 2' '-Dexecutor.test.path=$PATH' -Xmx1024m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:33887 --executor-id 9 --hostname 10.0.0.77 --cores 3 --app-id bddf96ce-45f6-45f7-ac83-5a3622cafc41-0032-driver-20180613103044-0001

@susanxhuynh if there is no issue we can move on with this fix. @krcz it would be good to add more tests like having a scenario where pure GC properties are passed to make it more realistic. This is pretty common in the field: d2iq-archive/spark-build#263

For example: "'--verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'". Btw I tested and that passes but it would be good to have such a test to avoid any regressions.

@susanxhuynh
Copy link
Contributor

@skonto Thanks for testing it. Tests results look good.

@felixcheung
Copy link
Member

ok, could you rebase this PR

@felixcheung
Copy link
Member

ok to test

@felixcheung
Copy link
Member

@mgummelt too

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91820 has finished for PR 20641 at commit 22c6739.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@krcz
Copy link
Author

krcz commented Jun 18, 2018

@felixb @skonto
I've triend rebasing, but it looks that the issue has been meanwhile fixed independently in other pull request while someone was working on SPARK-23941 issue. It looks that there is no unit test there though, so I can modify his PR to incorporate just the test. Do you want that or should I just close the pull request?

@felixb
Copy link

felixb commented Jun 18, 2018

@krcz
@felixcheung <- different felix ;)

@felixcheung
Copy link
Member

@krcz how is your PR different from 21014?

@krcz
Copy link
Author

krcz commented Jun 26, 2018

@felixcheung It fixes the same problem so in terms of implementation it is not very different. But when I created this one 21014 didn't exist.

The only difference is that this PR adds an unit test, which 21014 had not. That's why I'm asking if I should modify this PR to just add the test.

@felixcheung
Copy link
Member

yap, that sounds like a good idea

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vanzin
Copy link
Contributor

vanzin commented Dec 13, 2018

I'm closing this since there's been no activity. If you plan on adding the test, update your branch, and tag this with the other bug number (I duped this one to the bug with the fix attached to it).

@vanzin vanzin closed this Dec 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants