AIOpsLab is a holistic framework to enable the design, development, and evaluation of autonomous AIOps agents that, additionally, serves the purpose of building reproducible, standardized, interoperable and scalable benchmarks. AIOpsLab can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data, while orchestrating these components and providing interfaces for interacting with and evaluating agents.
Moreover, AIOpsLab provides a built-in benchmark suite with a set of problems to evaluate AIOps agents in an interactive environment. This suite can be easily extended to meet user-specific needs. See the problem list here.
This project offers flexible setup options to accommodate different user environments. Depending on your current setup, you can choose from one of the following paths:
-
Using Existing VMs with a Kubernetes cluster:
You can clone the repository using the following command. We recommend
poetry
for managing dependencies. You can also use a standardpip install -e .
to install the package.$ git clone <CLONE_PATH_TO_THE_REPO> $ cd AIOpsLab $ pip install poetry $ poetry install -vvv $ poetry shell
After entering the poetry virtual environment, setting Up AIOpsLab:
$ cd scripts $ ./setup.sh $(hostname) # or <YOUR_NODE_NAME>
-
Provisioning VMs and Kubernetes on cloud
Users can follow the instructions here, to create a two-node Kubernetes cluster on Azure. It can also be used as a starting point for creating more complex deployments, or deployments on other cloud. Then go to step 1 to set up the AIOpsLab's dependencies.
-
Self-managed Kubernetes cluster
You can also have a self-managed Kubernetes (k8s) cluster running as prerequisites. You can refer to our k8s installation, which installs k8s directly on the server (note that this is an installation example instead of an executable script; you may need to modify some parts to suit your case, e.g., node name and cert hash in the script). Then go to step 1 to set up the AIOpsLab's dependencies.
Human as the agent:
$ python3 cli.py
(aiopslab) $ start misconfig_app_hotel_res-detection-1 # or choose any problem you want to solve
# ... wait for the setup ...
(aiopslab) $ submit("Yes") # submit solution
Run GPT-4 baseline agent:
$ export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
$ python3 clients/gpt.py # you can also change the prolem to solve in the script
You can check the running status of the cluster using k9s or other cluster monitoring tools conveniently.
AIOpsLab can be used in the following ways:
AIOpsLab makes it extremely easy to develop and evaluate your agents. You can onboard your agent to AIOpsLab in 3 simple steps:
-
Create your agent: You are free to develop agents using any framework of your choice. The only requirements are:
-
Wrap your agent in a Python class, say
Agent
-
Add an async method
get_action
to the class:# given current state and returns the agent's action async def get_action(self, state: str) -> str: # <your agent's logic here>
-
-
Register your agent with AIOpsLab: You can now register the agent with AIOpsLab's orchestrator. The orchestrator will manage the interaction between your agent and the environment:
from aiopslab.orchestrator import Orchestrator agent = Agent() # create an instance of your agent orch = Orchestrator() # get AIOpsLab's orchestrator orch.register_agent(agent) # register your agent with AIOpsLab
-
Evaluate your agent on a problem:
-
Initialize a problem: AIOpsLab provides a list of problems that you can evaluate your agent on. Find the list of available problems here or using
orch.probs.get_problem_ids()
. Now initialize a problem by its ID:problem_desc, instructs, apis = orch.init_problem("k8s_target_port-misconfig-mitigation-1")
-
Set agent context: Use the problem description, instructions, and APIs available to set context for your agent. (This step depends on your agent's design and is left to the user)
-
Start the problem: Start the problem by calling the
start_problem
method. You can specify the maximum number of steps too:import asyncio asyncio.run(orch.start_problem(max_steps=30))
-
This process will create a Session
with the orchestrator, where the agent will solve the problem. The orchestrator will evaluate your agent's solution and provide results (stored under data/results/
). You can use these to improve your agent.
AIOpsLab provides a default list of applications to evaluate agents for operations tasks. However, as a developer you can add new applications to AIOpsLab and design problems around them.
Note: for auto-deployment of some apps with K8S, we integrate Helm charts (you can also use
kubectl
to install as HotelRes application). More on Helm here.
To add a new application to AIOpsLab with Helm, you need to:
-
Add application metadata
-
Application metadata is a JSON object that describes the application.
-
Include any field such as the app's name, desc, namespace, etc.
-
We recommend also including a special
Helm Config
field, as follows:"Helm Config": { "release_name": "<name for the Helm release to deploy>", "chart_path": "<path to the Helm chart of the app>", "namespace": "<K8S namespace where app should be deployed>" }
Note: The
Helm Config
is used by the orchestrator to auto-deploy your app when a problem associated with it is started.Note: The orchestrator will auto-provide all other fields as context to the agent for any problem associated with this app.
Create a JSON file with this metadata and save it in the
metadata
directory. For example thesocial-network
app: social-network.json -
-
Add application class
Extend the base class in a new Python file in the
apps
directory:from aiopslab.service.apps.base import Application class MyApp(Application): def __init__(self): super().__init__("<path to app metadata JSON>")
The
Application
class provides a base implementation for the application. You can override methods as needed and add new ones to suit your application's requirements, but the base class should suffice for most applications.
Similar to applications, AIOpsLab provides a default list of problems to evaluate agents. However, as a developer you can add new problems to AIOpsLab and design them around your applications.
Each problem in AIOpsLab has 5 components:
- Application: The application on which the problem is based.
- Task: The AIOps task that the agent needs to perform. Currently we support: Detection, Localization, Analysis, and Mitigation.
- Fault: The fault being introduced in the application.
- Workload: The workload that is generated for the application.
- Evaluator: The evaluator that checks the agent's performance.
To add a new problem to AIOpsLab, create a new Python file
in the problems
directory, as follows:
-
Setup. Import your chosen application (say
MyApp
) and task (sayLocalizationTask
):from aiopslab.service.apps.myapp import MyApp from aiopslab.orchestrator.tasks.localization import LocalizationTask
-
Define. To define a problem, create a class that inherits from your chosen
Task
, and defines 3 methods:start_workload
,inject_fault
, andeval
:class MyProblem(LocalizationTask): def __init__(self): self.app = MyApp() def start_workload(self): # <your workload logic here> def inject_fault(self) # <your fault injection logic here> def eval(self, soln, trace, duration): # <your evaluation logic here>
-
Register. Finally, add your problem to the orchestrator's registry here.
See a full example of a problem here.
Click to show the description of the problem in detail
-
start_workload
: Initiates the application's workload. Use your own generator or AIOpsLab's default, which is based on wrk2:from aiopslab.generator.workload.wrk import Wrk wrk = Wrk(rate=100, duration=10) wrk.start_workload(payload="<wrk payload script>", url="<app URL>")
Relevant Code: aiopslab/generators/workload/wrk.py
-
inject_fault
: Introduces a fault into the application. Use your own injector or AIOpsLab's built-in one which you can also extend. E.g., a misconfig in the K8S layer:from aiopslab.generators.fault.inject_virtual import * inj = VirtualizationFaultInjector(testbed="<namespace>") inj.inject_fault(microservices=["<service-name>"], fault_type="misconfig")
Relevant Code: aiopslab/generators/fault
-
eval
: Evaluates the agent's solution using 3 params: (1) soln: agent's submitted solution if any, (2) trace: agent's action trace, and (3) duration: time taken by the agent.Here, you can use built-in default evaluators for each task and/or add custom evaluations. The results are stored in
self.results
:def eval(self, soln, trace, duration) -> dict: super().eval(soln, trace, duration) # default evaluation self.add_result("myMetric", my_metric(...)) # add custom metric return self.results
Note: When an agent starts a problem, the orchestrator creates a
Session
object that stores the agent's interaction. Thetrace
parameter is this session's recorded trace.Relevant Code: aiopslab/orchestrator/evaluators/
aiopslab
Generators
generators - the problem generators for aiopslab ├── fault - the fault generator organized by fault injection level │ ├── base.py │ ├── inject_app.py │ ... │ └── inject_virtual.py └── workload - the workload generator organized by workload type └── wrk.py - wrk tool interface
Orchestrator
orchestrator ├── orchestrator.py - the main orchestration engine ├── parser.py - parser for agent responses ├── evaluators - eval metrics in the system │ ├── prompts.py - prompts for LLM-as-a-Judge │ ├── qualitative.py - qualitative metrics │ └── quantitative.py - quantitative metrics ├── problems - problem definitions in aiopslab │ ├── k8s_target_port_misconfig - e.g., A K8S TargetPort misconfig problem │ ... │ └── registry.py ├── actions - actions that agents can perform organized by AIOps task type │ ├── base.py │ ├── detection.py │ ├── localization.py │ ├── analysis.py │ └── mitigation.py └── tasks - individual AIOps task definition that agents need to solve ├── base.py ├── detection.py ├── localization.py ├── analysis.py └── mitigation.py
Service
service ├── apps - interfaces/impl. of each app ├── helm.py - helm interface to interact with the cluster ├── kubectl.py - kubectl interface to interact with the cluster ├── shell.py - shell interface to interact with the cluster ├── metadata - metadata and configs for each apps └── telemetry - observability tools besides observer, e.g., in-memory log telemetry for the agent
Observer
observer ├── filebeat - Filebeat installation ├── logstash - Logstash installation ├── prometheus - Prometheus installation ├── log_api.py - API to store the log data on disk ├── metric_api.py - API to store the metrics data on disk └── trace_api.py - API to store the traces data on disk
Utils
├── config.yml - aiopslab configs ├── config.py - config parser ├── paths.py - paths and constants ├── session.py - aiopslab session manager └── utils ├── actions.py - helpers for actions that agents can perform ├── cache.py - cache manager └── status.py - aiopslab status, error, and warnings
cli.py
: A command line interface to interact with AIOpsLab, e.g., used by human operators.@inproceedings{shetty2024building,
title = {Building AI Agents for Autonomous Clouds: Challenges and Design Principles},
author = {Shetty, Manish and Chen, Yinfang and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Zhang, Xuchao and Mace, Jonathan and Vandevoorde, Dax and Las-Casas, Pedro and Gupta, Shachee Mishra and Nath, Suman and Bansal, Chetan and Rajmohan, Saravan},
year = {2024},
booktitle = {Proceedings of 15th ACM Symposium on Cloud Computing},
}
@misc{chen2024aiopslab,
title = {AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds},
author = {Chen, Yinfang and Shetty, Manish and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Mace, Jonathan and Bansal, Chetan and Wang, Rujia and Rajmohan, Saravan},
year = {2024},
url = {https://www.microsoft.com/en-us/research/publication/aiopslab-a-holistic-framework-for-evaluating-ai-agents-for-enabling-autonomous-cloud/}
}
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party’s policies.