[pyspark] Support stage-level scheduling for training #9519

wbo4958 · 2023-08-24T08:27:05Z

XGBoost training task requires taking the whole GPU exclusively, so we have limited there's only 1 task running at the same time even for the ETL stages, which typically decreases the parallelism of spark tasks, and finally has some side-effect for the whole performance of xgboost end-2-end training.

The stage-level scheduling we are using in this PR does not change executor resources and hence does not require dynamic allocation to be enabled. The original executors are retained. The features used there, however, are new in Spark 3.4. We also changed CPU task resources therein to ensure only 1 task per gpu just for that barrier stage.

wbo4958 · 2023-08-24T08:40:28Z

@WeichenXu123 @trivialfis @eordentlich please help to review, Thx

trivialfis

Thank you for the excellent feature! You might need to update the document as well.

python-package/xgboost/spark/core.py

trivialfis · 2023-08-24T10:17:58Z

python-package/xgboost/spark/core.py

+        # Each training task requires cpu cores > total executor cores//2 + 1 which can
+        # make the tasks be sent to different executors.
+        #
+        # Please note that we can't set task_cpu_cores to the value which is smaller than
+        # total executor cores/2 because only task_gpu_amount can't make sure the tasks be
+        # sent to different executor even task_gpus=1.0
+
+        # Spark-rapids is a project to leverage GPUs to accelerate spark SQL and it can
+        # really help the performance. If spark-rapids is enabled. we don't allow other
+        # ETL gpu tasks running alongside training tasks to avoid OOM


I'm a bit confused by this comment. Let me try to rephrase it and see if I got it right.

To send the tasks to different executors, the task must require at least `total executor cores // 2 + 1` CPU cores. Without configuring the `task_cpu_cores` to a larger value, we can't make sure tasks are sent to different executors even if `task_gpu_amount` is set to 1. When the `spark-rapids` plugin is enabled for SQL acceleration, we need to prevent other GPU-based ETL tasks from running during training to avoid OOM errors.

Yes, you're correct.

Could you please update the comments with a cleaner explanation?

trivialfis · 2023-08-24T10:19:36Z

python-package/xgboost/spark/core.py

+
+        # task_gpus means how many gpu slots the task requires in a single GPU,
+        # it doesn't mean how many gpu shares it would like to require, so we
+        # can set it to any value of (0, 0.5] or 1.


but not [0.5, 1)?

Yeah, spark will throw exception if it's in [0.5, 1)

It's a bug in spark, I deleted the comment.

python-package/xgboost/spark/core.py

WeichenXu123 · 2023-08-24T10:59:19Z

python-package/xgboost/spark/core.py

+        # Each training task requires cpu cores > total executor cores//2 + 1 which can
+        # make the tasks be sent to different executors.


This is hacky, and it causes CPU cores wasted, assuming one xgboost train worker can only use one cpu core, but it makes spark allocate multiple CPU cores to the task, and these CPU resources cannot be used by other spark tasks until the xgboost train worker completes

Yes, that's correct. It will waste some CPU cores. Let's assume executor_cores=12, then the training task will require 7, but there are still 5 CPU cores left for the other spark task.

The ideal way to limit task number should rely on task_gpu_amount, but the current spark implementation has not supported it. We can report an issue for Spark and once Spark fixes that issue, we can rework this.

Q:

make the tasks be sent to different executors.

Why we need this rule ? one executor can hold multiple spark tasks, any issues it causes ?, e.g., we can config one exeuctor owns 4 cpu cores, 4 GPU cores,
then in this exeuctor we can start 4 xgboost train workers, each xgboost worker is allocated with 1 cpu and 1 GPU exclusively

This PR has bypassed this scenario that the executor.gpu.amount > 1, since it requires some special configuration to make it work.

https://github.com/dmlc/xgboost/pull/9519/files#diff-8900fd245992de81e0a7b8cb7fcd18c37bae4473e8672e9b11fd7010cac358cfR929-R937

@wbo4958

Then I think we can directly set executor.gpu.amount=1 and executor.cores=1 when we start spark application ?

@WeichenXu123, we can't use gpu resources to limit the task number. I had a design doc (supporting stage-level scheduling) for spark-rapids-ml with a similar design pattern with xgboost. the doc link can be found at https://docs.google.com/document/d/1C6s1nqNkOl0roNtRA5x1JPUuM2jykuR00JyMzgt0HvU/edit?usp=sharing which explained why we can't use GPU resources to do the task limitation.

This is bug:

Repro code:

import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests} val rdd = sc.range(0, 100, 1, 4) var rdd1 = rdd.repartition(3) val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0) val rp = new ResourceProfileBuilder().require(treqs).build rdd1 = rdd1.mapPartitions(iter => {Thread.sleep(30000); iter}).withResources(rp) rdd1.collect() # 3 spark tasks run in parallel on one executor

spark-shell --master spark://192.168.31.236:7077 --conf spark.executor.cores=12 --conf spark.executor.resource.gpu.amount=1

Let's see whether spark team can address the bug.
If it can't be fixed for now, let's use current workaround.

Hi @WeichenXu123, Yeah, looks like it is a spark bug. But Using the task.cpus to limit the task number in this PR can still achieve the same goal as using the task.gpu.amount, except wasting some cpus cores. Can we merge this PR first, and change it after spark fixes this bug?

wbo4958 · 2023-08-25T02:39:08Z

I did the performance test

Env:

Spark standalone cluster with 2 Nodes each having 12 CPU cores and 64G mem, and 1 TITAN V GPU with 12 G mem.

Dataset:

Mortgage, total 255, 962, 232 rows, 27 features + 1 label

Result,

W/O stage-level scheduling,

    .config("spark.executor.memory", "60g")
    .config("spark.executor.cores", 12)
    .config("spark.task.cpus", 12)
    .config("spark.executor.resource.gpu.amount", 1)
    .config("spark.task.resource.gpu.amount", 1)
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 2000000)

The training times are: 532.36, 534.46, average: 533.41

W/ stage-level scheduling

    .config("spark.executor.memory", "60g")
    .config("spark.executor.cores", 12)
    .config("spark.task.cpus", 1)
    .config("spark.executor.resource.gpu.amount", 1)
    .config("spark.task.resource.gpu.amount", 0.08)
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 2000000)

The training times are: 383.34, 383.11, average: 383.225

The ratio is 533.41/383.225 = 1.39, which means the PR can have 39% speedup.

The ETL phase is just reading the parquet file and doing the repartition, if we involve more real ETL operations, I guess, the stage-level scheduling PR can bring more speedup.

eordentlich · 2023-09-26T06:00:42Z

Any update on the status of this PR?

… mode

wbo4958 · 2023-10-11T08:55:39Z

Hi @WeichenXu123, Could you help to review it again? Thx very much.

WeichenXu123

LGTM,

once #9519 (comment) is fixed, please file a followup PR to remove workaround code.

wbo4958 · 2023-10-16T00:55:26Z

Hi @trivialfis, Could you help trigger CI ?

trivialfis

Could you please provide an example usage that gets everything right (with no warning), and add comments on the example for each configuration with an explanation for why the configuration is set in such a way? If I were a user I would struggle to understand and address all potential warnings.

python-package/xgboost/spark/core.py

trivialfis · 2023-10-16T06:21:52Z

python-package/xgboost/spark/core.py

+        # Each training task requires cpu cores > total executor cores//2 + 1 which can
+        # make the tasks be sent to different executors.
+        #
+        # Please note that we can't set task_cpu_cores to the value which is smaller than
+        # total executor cores/2 because only task_gpu_amount can't make sure the tasks be
+        # sent to different executor even task_gpus=1.0
+
+        # Spark-rapids is a project to leverage GPUs to accelerate spark SQL and it can
+        # really help the performance. If spark-rapids is enabled. we don't allow other
+        # ETL gpu tasks running alongside training tasks to avoid OOM


Could you please update the comments with a cleaner explanation?

wbo4958 · 2023-10-16T09:40:55Z

Could you please provide an example usage that gets everything right (with no warning), and add comments on the example for each configuration with an explanation for why the configuration is set in such a way? If I were a user I would struggle to understand and address all potential warnings.

I will file a follow-up PR to add related doc

trivialfis · 2023-10-16T09:51:46Z

python-package/xgboost/spark/core.py

+
+            if int(executor_cores) == 1:
+                # there will be only 1 task running at any time.
+                self.logger.info(


why is this info instead of warning? Can we pick one consistently? (I like info a bit more, but feel free to pick one as long as it's consistent)

@wbo4958 Let's draw the line here:

warning for things that users should fix/address, similar to a bug but slightly less critical.

info for things that "good to know", but totally fine if I don't know.

Changed to info

python-package/xgboost/spark/core.py

trivialfis · 2023-10-16T10:02:16Z

python-package/xgboost/spark/core.py

+
+            if float(task_gpu_amount) == float(executor_gpus):
+                self.logger.warning(
+                    "The configuration of cores (exec=%s, task=%s, runnable tasks=1) will "


is this CPU cores or GPU cores (I don't know what a GPU core is), but the condition is testing the number of GPUs.

No, it's testing how many tasks will run at the same time on the executor.

task_gpu_amount is a fraction value. we can use executor_gpus / task_gpu_amount to get the concurrent tasks.

Since you are raising a warning, could you please revise the message so that users understand what you mean by "cores"?

Removed this duplicated message since spark itself will throw a warning on the wasted resources.

python-package/xgboost/spark/core.py

trivialfis · 2023-10-16T10:19:58Z

python-package/xgboost/spark/core.py

+
+            if not _is_standalone_or_localcluster(sc):
+                self.logger.warning(
+                    "Stage-level scheduling in xgboost requires spark standalone or "


Can it be used on other types of clusters (maybe in the future)? Is there any cloud service like EMR or Dataproc can be used with this?

Yes. from spark 3.5.1, we can enable this feature on spark standalone/ k8s / yarn cluster.

Do we need a condition to check for 3.5.1?

3.5.1 is not released yet. I will file another PR once it's available.

Co-authored-by: Bobby Wang <[email protected]>

* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655) * [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656) * Bump version to 2.0.1. (dmlc#9660) * [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668) * Fix build for GCC 8.x (dmlc#9670) * [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686) Co-authored-by: Bobby Wang <[email protected]> * Fix build for AppleClang 11 (dmlc#9684) * Fix libpath logic for Windows (dmlc#9687) * [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704) * [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705) * Use sys.base_prefix instead of sys.prefix (dmlc#9711) * Use sys.base_prefix instead of sys.prefix * Update libpath.py too --------- Co-authored-by: Jiaming Yuan <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Bobby Wang <[email protected]>

…9686) Co-authored-by: Bobby Wang <[email protected]>

* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655) * [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656) * Bump version to 2.0.1. (dmlc#9660) * [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668) * Fix build for GCC 8.x (dmlc#9670) * [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686) Co-authored-by: Bobby Wang <[email protected]> * Fix build for AppleClang 11 (dmlc#9684) * Fix libpath logic for Windows (dmlc#9687) * [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704) * [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705) * Use sys.base_prefix instead of sys.prefix (dmlc#9711) * Use sys.base_prefix instead of sys.prefix * Update libpath.py too --------- Co-authored-by: Jiaming Yuan <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Bobby Wang <[email protected]>

…9686) Co-authored-by: Bobby Wang <[email protected]>

* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655) * [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656) * Bump version to 2.0.1. (dmlc#9660) * [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668) * Fix build for GCC 8.x (dmlc#9670) * [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686) Co-authored-by: Bobby Wang <[email protected]> * Fix build for AppleClang 11 (dmlc#9684) * Fix libpath logic for Windows (dmlc#9687) * [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704) * [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705) * Use sys.base_prefix instead of sys.prefix (dmlc#9711) * Use sys.base_prefix instead of sys.prefix * Update libpath.py too * [backport] Fix using categorical data with the ranker. (dmlc#9753) (dmlc#9778) * [jvm-packages] Add Scala version suffix to xgboost-jvm package (dmlc#9776) * Update JVM script (dmlc#9714) * Bump version to 2.0.2; revamp pom.xml * Update instructions in prepare_jvm_release.py * Fix formatting * fix merging for R-package/configure --------- Co-authored-by: Jiaming Yuan <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Bobby Wang <[email protected]> Co-authored-by: Dmitry Razdoburdin <>

trivialfis reviewed Aug 24, 2023

View reviewed changes

WeichenXu123 reviewed Aug 24, 2023

View reviewed changes

python-package/xgboost/spark/core.py Show resolved Hide resolved

WeichenXu123 reviewed Aug 24, 2023

View reviewed changes

wbo4958 requested review from trivialfis and WeichenXu123 August 25, 2023 02:50

wbo4958 added 3 commits October 11, 2023 12:16

[pyspark] Support stage-level scheduling for training

50c83eb

comments

a265236

avoid stage-level scheduling on non spark standalone and localcluster…

c8102f1

… mode

wbo4958 force-pushed the stage-level branch from 91773d3 to c8102f1 Compare October 11, 2023 06:57

WeichenXu123 approved these changes Oct 12, 2023

View reviewed changes

Merge branch 'master' into stage-level

06f29fa

trivialfis reviewed Oct 16, 2023

View reviewed changes

resolve comments

2cb7538

trivialfis reviewed Oct 16, 2023

View reviewed changes

comments

7b55108

trivialfis approved these changes Oct 16, 2023

View reviewed changes

trivialfis merged commit 4d1607e into dmlc:master Oct 17, 2023
26 checks passed

wbo4958 deleted the stage-level branch October 17, 2023 23:39

trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Oct 18, 2023

[backport][pyspark] Support stage-level scheduling (dmlc#9519)

03743da

trivialfis added a commit that referenced this pull request Oct 18, 2023

[backport][pyspark] Support stage-level scheduling (#9519) (#9686)

f7da938

Co-authored-by: Bobby Wang <[email protected]>

razdoburdin pushed a commit to IntelPython/xgboost_oneapi that referenced this pull request Oct 26, 2023

[backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#…

c5f73d3

…9686) Co-authored-by: Bobby Wang <[email protected]>

razdoburdin pushed a commit to razdoburdin/xgboost that referenced this pull request Oct 26, 2023

[backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#…

27a0ecb

…9686) Co-authored-by: Bobby Wang <[email protected]>

		# Each training task requires cpu cores > total executor cores//2 + 1 which can
		# make the tasks be sent to different executors.

[pyspark] Support stage-level scheduling for training #9519

[pyspark] Support stage-level scheduling for training #9519

Conversation

wbo4958 commented Aug 24, 2023

wbo4958 commented Aug 24, 2023

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Aug 25, 2023

Env:

Dataset:

Result,

eordentlich commented Sep 26, 2023

wbo4958 commented Oct 11, 2023

WeichenXu123 left a comment

Choose a reason for hiding this comment

wbo4958 commented Oct 16, 2023

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Oct 16, 2023

Choose a reason for hiding this comment

trivialfis Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Aug 24, 2023 •

edited

Loading

WeichenXu123 Aug 25, 2023 •

edited

Loading

WeichenXu123 Oct 12, 2023 •

edited

Loading

trivialfis left a comment •

edited

Loading

trivialfis Oct 16, 2023 •

edited

Loading

trivialfis Oct 16, 2023 •

edited

Loading