Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qualification support for Photon jobs in the Python Tool #1409

Merged

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Nov 2, 2024

Issue #251.

This PR introduces support for recommending Photon applications, using a separate strategy for categorizing them:

  • Spark Runtime: Recommend apps with a speedup greater than 1.3x.
  • Photon Runtime: Recommend apps with a speedup greater than 1x.

Additionally, the Small category for Photon applications is different from that of Spark-based applications:

  • Spark Runtime: Apps with a speedup in the range of 1.3x to 2x are categorized as Small.
  • Photon Runtime: Apps with a speedup in the range of 1x to 2x are categorized as Small.

Note

  • Speedup Strategy is assigned on a per-app basis, enabling support for heterogeneous cases.
  • Hence, if a user provides both Photon and Spark event logs, the Python Tool will apply separate strategy for each app based on its execution engine (Spark or Photon)

Output

  • As this is a metadata property, for each app, included an entry sparkRuntime in app_metadata.json
  {
    "appId": "app-20240818062343-0000",
    "appName": "Databricks Shell",
    "eventLog": "file:/path/to/log/photon_eventlog",
    "sparkRuntime": "PHOTON",
    "estimatedGpuSpeedupCategory": "Not Recommended"
  }

Changes

Enhancements and New Features:

  • tool_ctxt.py: Introduced a new method get_metrics_output_folder to fetch the metrics output directory.
  • qualification-conf.yaml: Updated configuration to include new metrics subfolder and execution engine settings. [1] [2] [3] [4]
  • enums.py: Added a new ExecutionEngine class to represent different execution engines.
  • speedup_category.py: Introduced SpeedupStrategy class and refactored methods to accommodate execution engine-specific speedup strategies. [1] [2] [3] [4]

Refactoring and Utility Improvements:

  • qualification.py: Added a helper method _read_qualification_metric_file to read metric files and _assign_execution_engine_to_apps to assign execution engines to applications.
  • util.py: Added a utility method convert_df_to_dict to convert DataFrames to dictionaries.

Tests:

  • event_log_processing.feature: Added new test scenarios to validate the execution engine assignment.
  • e2e_utils.py and test_steps.py: Updated end-to-end test utilities to support new features. [1] [2] [3]

Follow Up

@parthosa parthosa added feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python) labels Nov 2, 2024
@parthosa parthosa self-assigned this Nov 2, 2024
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa parthosa marked this pull request as ready for review November 4, 2024 20:19
@parthosa parthosa added the affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) label Nov 4, 2024
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !
Just for sake of confirmation:

  • Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not? In that case the PR description is not accurate because it gives impression that it adds support e-2-e.
  • I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper). CC: @tgravescs

upperBound: 1000000.0
- columnName: 'Unsupported Operators Stage Duration Percent'
lowerBound: 0.0
upperBound: 25.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some thinking on the impact of design.
This introduces a platform configuration inside the tool's conf. On the other hand, we do have a configuration file per platform.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking since all platforms would have the same value for spark case, we would be duplicating the configuration in each platform. In future, if we have different values for different platform, we could put these in separate platform config files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a valid point that there are some common settings between platforms.
In future, we can improve our config structure to have common parent or something shared between all the platforms.
The other way around of specifying the platfrom behavioor inside the tools config will trigger a design inconsistency moving fwd; especially with every contributor's preference on where a newly added config should go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is okay for now to keep that in order to unblock the photon feature.
Later, we can revisit this.

user_tools/src/spark_rapids_tools/enums.py Outdated Show resolved Hide resolved
user_tools/src/spark_rapids_tools/utils/util.py Outdated Show resolved Hide resolved
@parthosa
Copy link
Collaborator Author

parthosa commented Nov 6, 2024

From offline discussions with @amahussein and @leewyang, moving the detection of runtime (Spark/Photon/Velox) to Scala.

This PR will be refactored afterwards.

@parthosa parthosa marked this pull request as draft November 6, 2024 23:12
@parthosa parthosa marked this pull request as ready for review November 12, 2024 22:38
@parthosa
Copy link
Collaborator Author

@amahussein

Is there another followup PR to change the QualX module to read the app_meta.json to decide whether this app is photon or not?

I am concerned about how we can troubleshoot and validate app_meta.json. the wrapper reads the autotuner's output and copy some of the fields to that file in the upper level. With this PR, we are adding a new field derived from python logic. Later, we will hit a question "Where does each field come from?" (this becomes even more challenging if fields might be overridden by Python wrapper).

  • Similarly, all values in app_meta.json will now be derived from Scala logic, with no Python logic involved.

cindyyuanjiang
cindyyuanjiang previously approved these changes Nov 14, 2024
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa!

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa

Just add a comment in the config file to explain why we picked those new threshold for the photon categories..

user_tools/src/spark_rapids_tools/storagelib/csppath.py Outdated Show resolved Hide resolved
upperBound: 1000000.0
- columnName: 'Unsupported Operators Stage Duration Percent'
lowerBound: 0.0
upperBound: 25.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is okay for now to keep that in order to unblock the photon feature.
Later, we can revisit this.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa! LGTM.

@parthosa parthosa merged commit 43825d8 into NVIDIA:dev Nov 14, 2024
14 checks passed
@parthosa parthosa deleted the spark-rapids-tools-251-support-photon-in-python branch November 14, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants