Skip to content

Releases: NVIDIA/spark-rapids-tools

v24.10.2

06 Dec 16:35
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Update models for latest tools code (#1448)
  • More flexible regexes; fix default split function (#1443)
  • Update models for latest code and dataset JSON (#1442)
  • Add model for databricks-azure_photon and update combined model (#1427)
  • Remove custom-speedup module from user-tools (#1425)

Core

  • Count expressions per Exec in SQLPlanParser (#1449)
  • Report all operators in the output file (#1444)
  • Fix missing exec-to-stageId mapping in Qual tool (#1437)
  • [BUG] Fix Profiler tool index out of bound exception when generating diagnostic metrics (#1439)
  • Sort Qual execs report by sqlId and nodeId (#1436)
  • Include expression parsers for HashAggregate and ObjectHashAggregate (#1432)
  • [FEA] Add stage/task level diagnostic output for GPU slowness in Profiler tool (#1375)
  • Reduce the log noise caused by core report summary (#1426)
  • Trigger GC at the beginning of each benchmark iteration (#1424)

Miscellaneous

  • [BUG] Fix sync plugin files script to handle empty or non-existing cvs files (#1446)
  • Enable license header check (#1440)

v24.10.1

15 Nov 02:36
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Add qualification support for Photon jobs in the Python Tool (#1409)
  • Add qualx support for platform runtime variants (DB AWS) (#1417)
  • Update models for latest emr, onprem eventlogs (#1410)

Core

  • Adding EMR-specific tunings for shuffle manager and ignoring jar (#1419)
  • Changing autotuner memory error to warning in comments (#1418)
  • Add sparkRuntime property to capture runtime type in application_information (#1414)
  • Refactor Exec Parsers - remove individual parser classes (#1396)
  • Remove estimated GPU duration from qualification output (#1412)

v24.10.0

04 Nov 23:23
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • [FEA] Allow users to specify custom Dependency jars (#1395)
  • Reduce default memory allocation to the java process (#1407)
  • Update error handling in python for parsing cluster information (#1394)
  • user-tools should add xms argument to java cmd (#1391)
  • Use environment variables to set thresholds in static yaml configurations (#1389)
  • Use StorageLib to download dependencies (#1383)
  • Remove total core second heuristic and filter apps only in top candidate view (#1376)
  • Generate log files for Python Profiling cli (#1366)
  • Update models for updated datasets and latest code (#1365)
  • Isolate dataset for qualx plugin invocations (#1361)
  • [FEA] Add total core seconds into top candidate view (#1342)
  • Fix python tool picking up wrong JAR version in Fat wheel mode (#1357)
  • [FOLLOWUP-1326] Set Spark version to 3.4.2 by default for onprem environment (#1358)
  • Disable too-many-positional-arguments in pylintrc (#1353)
  • Reduce console output tree level, exclude JAR tool output files and remove incorrect logging (#1340)

Core

  • Add support for Photon-specific SQL Metrics (#1390)
  • Add support for processing Photon event logs in Scala (#1338)
  • Add Reflection to support custom Spark Implementation at Runtime (#1362)
  • Improve AQE support by capturing SQLPlan versions (#1354)
  • Add PartitionFilters and DataFilters to the dataSourceInfo table (#1346)
  • Add support to ArrayJoin in Qualification tool (#1345)

Miscellaneous

  • Cluster information should handle dynamic allocation and nodes being removed and added (#1369)
  • Rename tag core to core_tools (#1350)

v24.08.2

10 Sep 21:25
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Add end-to-end behavioural tests for the python CLI (#1313)
  • Add documentation for qualx plugins (#1337)
  • Allow spark dependency to be configured dynamically (#1326)
  • Follow-up 1318: Fix QualX fallback with default speedup and duration columns (#1330)
  • Updated models for EMR NDS-H dataset (#1331)

Core

  • [FEA] Add total core seconds in Qualification core tool output (#1320)
  • Add support to MaxBy and MinBy in Qualification tool (#1335)
  • Add safeguards to prevent older attempts from generating metrics output in Scala Tool (#1324)
  • Sync up DAYTIME and YEARMONTH fields with CSV plugin files (#1328)

Miscellaneous

  • Update signoff usage [skip ci] (#1332)

v24.08.1

04 Sep 01:06
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • [DOC] spark_rapids CLI help cmd still shows cost savings (#1317)
  • Fix Qualification and Profiling tools CLI argument shorthands (#1312)
  • Raise error for enum creation from invalid string values (#1300)
  • Append HADOOP_CONF_DIR to the tools CLASSPATH execution cmd (#1308)
  • Fix key error and cross-join error during qualx evaluate (#1298)
  • Qual tool: Print more useful log messages when failures happen downloading dependencies (#1292)
  • Fix --help text for custom_model_file option (#1285)

Core

  • Remove legacy SpeedupFactor from core output files (#1318)
  • Mark decimalsum as supported in Qualification tool (#1323)
  • Mark SMJ as unsupported operator for corner cases in left join (#1309)
  • Remove arguments and code related to the html-report (#1311)
  • Handle SparkRapidsBuildInfoEvent in GPU event logs (#1203)
  • Enable recursive search for event logs by default and optional --no-recursion flag (#1297)
  • Qualification tool support filtering by a filesystem time range (#1299)
  • Skip generating timeline for stages that do not have completion time (#1290)
  • Save core tools logs to output log file (#1269)
  • Qualification tool - Add option to filter by minimum event log size (#1291)
  • Include exception message for unknown app status in core tool (#1281)

Miscellaneous

  • Remove restricted google sheets link and outdated TCO section (#1289)

v24.08.0

13 Aug 02:52
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Remove calculation of gpu cluster recommendation from python tool when cluster argument is passed (#1278)
  • Remove unused argument --target_platform in Python Tool (#1279)
  • Qualification tool: Add output stats file for Execs(operators) (#1225)
  • Include GPU information in the cluster recommendation for Dataproc and OnPrem (#1265)
  • Remove speedup based recommendation column from qual_summary csv (#1268)
  • Fix prediction CSV files for multiple qual directories (#1267)
  • Clean up tools after removing CLI dependency (#1256)
  • Rename cluster shape columns to use 'worker' prefix in the output files and rename metadata file (#1258)
  • Remove CLI dependency in Dataproc _pull_gpu_hw_info implementation (#1245)
  • Replace split_nds with split_train_val (#1252)
  • Update xgboost models and metrics (#1244)
  • Add footnotes for config recommendations and speedup category in top candidate view (#1243)
  • [BUG] Update Dataproc instance catalog for n1 series GPU info (#1242)
  • Improvements in Cluster Config Recommender (#1241)
  • Improve console output from python tool for failed/gpu/photon event logs (#1235)
  • [FEA] Generate and use instance description file for Databricks-Azure platform (#1232)
  • Remove arguments related to cost-savings (#1230)
  • Updated models for latest databricks-aws datasets (#1231)
  • Refactor QualX for Linter and Test Compatibility (#1228)
  • Generate summary metadata file and fix node recommendation in python (#1216)
  • [FEA] Remove gcloud CLI dependency for Dataproc platform (#1223)
  • Updated models for latest dataproc eventlogs (#1226)
  • Remove estimation-model column from qualification summary (#1220)
  • Add option to add features.csv files to training set (#1212)
  • Disable cost saving functionality (#1218)
  • [FEA] Remove CLI dependency for EMR and Databricks-AWS platforms in user tool (#1196)
  • Fix some basic pylint errors in qualx code (#1210)
  • Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog (#1188)
  • Add shap command to internal CLI for debugging (#1197)
  • Add internal CLI to generate instance descriptions for CSPs (#1137)
  • [FEA] Support custom XGBoost model file via user tools CLI (#1184)
  • Updated models for new training data (#1186)
  • Add evaluate_summary command to internal CLI (#1185)
  • [DOC] Fix broken link to qualX docs and update python prerequisites (#1180)
  • Bump to certifi-2024.7.4 and urllib3-1.26.19 (#1173)
  • Disable UI-HTML report by default in Qualification tool (#1168)
  • Fix parsing App IDs inside metrics directory in QualX (#1167)
  • Refactor Databricks-AWS Qual tool to cache and process pricing info from DB website (#1141)
  • Add plugin mechanism for dataset-specific preprocessing in qualx (#1148)
  • Unsupported op logic should read action column from qual's output (#1150)
  • Update qualx readme for training (#1140)
  • Disable pylint-unreachable code in tox.ini (#1145)

Core

  • Include GPU information in the cluster recommendation for Dataproc and OnPrem (#1265)
  • [TASK] Optimize the storage of accumulables in core tools (#1263)
  • Sync GetJsonObject support with Rapids-Plugin (#1266)
  • Do not create new StageInfo object (#1261)
  • [FEA] Add support for map_from_arrays in qualification tools (#1248)
  • Rename cluster shape columns to use 'worker' prefix in the output files and rename metadata file (#1258)
  • Fix stage level metrics output csv file (#1251)
  • Handle event logs with wildcards in status report generation (#1237)
  • Fix duplicate records in DataSourceInfo report (#1227)
  • Reduce memory footprint of stageInfo (#1222)
  • Ensure UTF-8 encoding for reading non-english characters (#1211)
  • Sync plugin support for hash-hive and shift operators (#1198)
  • Sync-up the support of parse_url in qualification tool (#1195)
  • Include status information for failed event logs in core tool (#1187)
  • [FEA] Adding Benchmarking classes to evaluate core tools performance (#1169)
  • [BUG] Fix handling of non-english characters in tools output files (#1189)
  • [Bug] Fix java Qual tool handling of --platform argument (#1161)
  • Add all stage metrics to tools output (#1151)
  • Follow-up 1142: remove TODO line (#1146)
  • Mark wholestageCodeGen as shouldRemove when child nodes are removed (#1142)
  • [FEA] Display full failure messages in failed CSV files (#1135)

Miscellaneous

  • Qualification tool: Add option to filter event logs for a maximum file system size (#1275)
  • Qualification tool should print Kryo related recommendations (#1204)
  • Fix header check script to exclude files (#1224)
  • Update header check script for pre-commit hooks (#1219)
  • Follow-up 1189: handle non-english characters in data-output.js (#1208)
  • Update pre-commit hooks to check for headers and white-spaces (#1205)
  • user-tools:Update --help for cluster argument (#1178)
  • Support fine-tuning models (#1174)

v24.06.1

18 Jun 22:44
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Fix Python runtime error caused by numpy 2.0.0 release (#1130)
  • Disable the spark_rapids bootstrap command (#1114)

Core

  • Handle different exception thrown by incomplete eventlogs (#1124)
  • Include number of executors per node in cluster information (#1119)

v24.06.0

12 Jun 20:07
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Add support to Python 3.12 (#1111)
  • user-tools: Update log messages (#1110)
  • Enable xgboost prediction model by default (#1108)
  • Add support to Python3.11 (#1105)
  • Fix nan label issue in training (#1104)
  • Fix qualx app metrics (#1102)
  • clip appDuration to at least Duration (#1096)
  • Fix missing assignment to savings_recommendations (#1098)
  • Handle QualX behaviour when Qual Tool does not generate any outputs (#1095)
  • Fix internal predict CLI and remove preprocessed argument (#1093)
  • Update QualX to return default speedups and fix App Duration for incomplete apps (#1089)
  • fix signature error from overlapping merges (#1084)
  • sync w/ internal repo; update models (#1083)
  • Reduce the maximum number of Java threads in CLI (#1082)
  • Remove using Profiler metrics for QualX and Heuristics (#1080)
  • Port QualX repo and add CLI for train (#1076)
  • User tools fallback to default zone/region (#1054)
  • Handle missing pricing info for user qual tool on Databricks platforms (#1053)
  • Split job and stage level aggregated metrics into different files (#1050)
  • Skip Cluster Inference when CSP CLIs are missing or not configured (#1035)
  • Store Cluster Shape Recommendation in User Tools Qualification Output (#1005)
  • Fix calculation of unsupported operators stage duration percentage (#1006)
  • Update Databricks Azure qual tool to set env variable for ABFS paths (#1016)
  • Add heuristics using stage spill metrics to skip apps (#1002)
  • Fix failure in github workflow's pylint (#1015)
  • Updating qual validation script to directly use top candidate view recommendation (#1001)

Core

  • Fix typo in Profiler class using qual instead of prof (#1113)
  • Fix missing appEndTime in raw_metrics folder (#1092)
  • Sync tools with plugin newly supported operators (#1066)
  • Fix java Qual tool Autotuner output when GPU device is missing (#1085)
  • Update the Qual tool AutoTuner Heuristics against CPU event logs (#1069)
  • Handling FileNotFound exception in AutoTuner (#1065)
  • Handle metric names from legacy spark (#1052)
  • Split job and stage level aggregated metrics into different files (#1050)
  • Refactor ProfileResult classes to implement new interface design and add CSV output to Qual Tool (#1043)
  • Hook up the auto tuner in the qualification tool (#1039)
  • Profiler should identify the delta log ops and generate views for non-delta logs (#1031)
  • Qualification tool - Handle cancelled jobs and stages better and don't skip the app (#1033)
  • [FEA] Generate Status Report for Profiling Tool (#1012)
  • Fix calculation of unsupported operators stage duration percentage (#1006)
  • Fix potential problems and AQE updates in Qual tool (#1021)
  • Sync supported operators with plugin changes and update default score (#1020)
  • Refactor TaskEnd to be accessible by Q/P tools (#1000)

Miscellaneous

  • Bump requests from 2.31.0 to 2.32.2 in /data_validation (#1077)

v24.04.0

07 May 21:20
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • [FEA] Add CLI to run prediction on estimation_model (#961)
  • Adding SHAP predict values as new output file (#982)
  • Update docs for building to clarify to build in a virtual environment (#976)

Core

  • [BUG] Catch Profiler error when app info is empty (#994)
  • Get stages from sqlId for collecting info for output writer functions (#996)
  • Account for joboverhead time in qualification tool estimation (#992)
  • [Followup] Fix handling of clusterTags and SparkVersion in Q/P Tools (#993)
  • Fix handling of clusterTags and SparkVersion in Q/P Tools (#991)
  • Refactor AppBase to use common AppMetaData between Q/P tools (#983)
  • Refactor Stage info code between Q/P tools (#971)

v24.02.4

30 Apr 17:07
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Fix Hadoop Azure version to be compatibe with Spark-3.5.0 (#975)
  • Add speedup categories in qualification summary output (#958)
  • Improve cluster node initialisation for CSPs (#964)

Core

  • Remove databricks profiling recommendation for dynamicFilePruning (#972)
  • Add AQEShuffleRead WriteFiles execs to the supportedOps and score files (#963)
  • [FEA] Automate appending new operators to the platform score sheets (#954)
  • Add support for InSubqueryExec Expression (#960)

Miscellaneous

  • Bump dev version to 24.02.4 (#968)
  • Revert versions back to 24.02.3 (#967)