This sample demonstrates how to:
- Emit custom metrics from your Job Runs.
- Query service metrics and custom metrics for your Job Runs.
You need a dynamic group that includes your job run resources.
Example matching rule:
all {resource.type='datasciencejobrun'}
You need a policy that enables resources in this dynamic group to post metrics. See https://docs.oracle.com/iaas/Content/Identity/Reference/monitoringpolicyreference.htm
Example policy:
allow dynamic-group my_job_runs to use metrics in tenancy where target.metrics.namespace='my_custom_namespace'
You need a policy that enables your user to read metrics. See https://docs.oracle.com/iaas/Content/Identity/Reference/monitoringpolicyreference.htm
Example policy:
allow group metric_reader_group to read metrics in compartment my_compartment
On your local machine, you need to install the OCI Python SDK: https://docs.oracle.com/en-us/iaas/tools/python/latest/
On your local machine, you need to setup an API Key: https://docs.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm
- The
artifact
directory contains all the files included in the job artifact.- The
custom_metrics
directory defines classes used to create the custom metrics.custom_metrics_provider.py
- Includes a base class for custom metric providers.gpu_metrics_provider.py
- Invokesnvidia-smi
to query GPU properties and builds custom metrics for GPU power draw, temperature, GPU utilization, and memory usage.random_metrics_provider.py
- A very simple custom metric provider example that generates a random value.
entrypoint.sh
- The job entrypoint script. It starts themetrics_submitter.py
script in the background and then sleeps for 10 minutes.metrics_submitter.py
- Every minute, this script invokes the GPU metrics provider and the random metrics provider. The custom metrics are then pushed to the OCI Monitoring Service. This script uses Resource Principal authentication to submit metrics.
- The
generate_csv.py
- Queries metric values over the specified time interval and saves them to a csv file.get_current_metrics.py
- Queries the current metric values of an ongoing job run. Refreshes the display every minute.package_artifact.sh
- Creates the job artifact.query_helpers.py
- Helper class with common methods used by bothgenerate_csv.py
andget_current_metrics.py
.
To create the job artifact, navigate to the sample directory and run:
$ ./package_artifact.sh
This script will archive the contents of the artifact
directory to build/artifact.tar.gz
Using the console, create a job. See https://docs.oracle.com/iaas/data-science/using/jobs-create.htm
For the job artifact, use the build/artifact.tar.gz
file created in the previous step.
You must specify the following environment variables:
METRICS_NAMESPACE
- The namespace you want to store your custom metrics under. Each metric is associated with a single namespace. The namespace to search must be provided when querying metrics later.JOB_RUN_ENTRYPOINT
- Set this toentrypoint.sh
.
If you select a standard shape or flex shape, the job run will only output a single custom metric named random
.
If you select a GPU shape, the job run will emit the random
metric as well as 4 custom GPU metrics:
gpu.power_draw
gpu.temperature
gpu.gpu_utilization
gpu.memory_usage
NOTE: The service does emit a GPU Utilization metric that averages the utilization of all GPU cards. With these custom metrics, one metric is emitted per card, and the PCI Bus is added as a metric dimension to distinguish them. When querying metrics, the aggregated data points are grouped by their dimensions. You can also include dimensions in your MQL queries to filter your results, see https://docs.oracle.com/iaas/Content/Monitoring/Reference/mql.htm#dimension
For networking, if you choose to provide your own subnet, it must support egress to the OCI Monitoring Service, either through a Service Gateway or a NAT Gateway.
Once you've created your job, you can start a job run.
Once your job run has reached the IN PROGRESS
state, you can run get_current_metrics.py
to query the current values
of the service metrics and custom metrics emitted by your job run.
The script will query the OCI Monitoring Service every minute.
This script uses API Key authentication to query metrics.
$ python ./get_current_metrics.py --help
usage: get_current_metrics.py [-h] [--namespace NAMESPACE] [--config CONFIG]
[--profile PROFILE]
job_run_ocid
positional arguments:
job_run_ocid The OCID of the job run whose metrics should be
displayed.
optional arguments:
-h, --help show this help message and exit
--namespace NAMESPACE
The namespace to query for custom metrics. If
unspecified, only service metrics will be displayed.
--config CONFIG The path to the OCI config file. Defaults to
~/.oci/config
--profile PROFILE The OCI config profile to use. Defaults to DEFAULT
Example invocation:
$ python ./get_current_metrics.py ocid1.datasciencejobrun.oc1.aaaaa --namespace my_namespace
If your job run has already completed, or if you're interested in viewing metric data over a given time interval, use
the generate_csv.py
script.
This script uses API Key authentication to query metrics.
$ python ./generate_csv.py --help
usage: generate_csv.py [-h] [--out OUT] [--namespace NAMESPACE]
[--config CONFIG] [--profile PROFILE]
[--start-utc START_UTC] [--end-utc END_UTC]
job_run_ocid
positional arguments:
job_run_ocid The OCID of the job run whose metrics should be
queried.
optional arguments:
-h, --help show this help message and exit
--out OUT The path to the output csv file. Defaults to
metrics.csv
--namespace NAMESPACE
The namespace to query for custom metrics. If
unspecified, only service metrics will be queried.
--config CONFIG The path to the OCI config file. Defaults to
~/.oci/config
--profile PROFILE The OCI config profile to use. Defaults to DEFAULT
--start-utc START_UTC
The UTC start time for the metric query in ISO format.
Ex: 2023-01-31 or '2023-01-31 18:45:30'. Defaults to
24 hours ago.
--end-utc END_UTC The UTC end time for the metric query in ISO format.
Ex: 2023-01-31 or '2023-01-31 18:45:30'. Defaults to
the current time.
Example invocation:
$ python ./generate_csv.py ocid1.datasciencejobrun.oc1.aaaaa --namespace my_namespace --start-utc '2023-01-05 18:00:00' --end-utc '2023-01-05 19:00:00'
If you prefer, you can craft metric queries in the OCI console and customize graphs displaying different metrics over custom time ranges. See https://docs.oracle.com/iaas/Content/Monitoring/Tasks/buildingqueries.htm#console
The entrypoint.sh
script in this sample just sleeps after starting the metric submitter. You can replace the sleep
call with your own commands, or with calls to your own custom scripts.
If you want to include additional files in your job artifact, just copy them under the artifact
directory. The
package_artifact.sh
script will automatically include all files under this directory in the generated
build/artifact.tar.gz
file.
To define your own custom metrics:
- Create a class in the
artifact/custom_metrics
directory that that extendsCustomMetricsProvider
. Implement theget_metrics()
function to produce your own custom metrics values. - In
artifact/metrics_submitter.py
, initialize your custom metrics class in themetric_providers
collection defined near the top of the file.
Use the existing GpuMetricsProvider
and RandomMetricsProvider
classes for reference.
OCI Data Science Jobs - https://docs.oracle.com/iaas/data-science/using/jobs-about.htm
OCI Monitoring Service - https://docs.oracle.com/iaas/Content/Monitoring/home.htm