Multi Agent Reinforcement Learning examples

copandrej · Nov 13, 2024 · 342e98a · 342e98a
1 parent 4f15eb3
commit 342e98a
Show file tree

Hide file tree

Showing 9 changed files with 849 additions and 8 deletions.
diff --git a/MADRL_examples/README.md b/MADRL_examples/README.md
@@ -0,0 +1,51 @@
+# Examples of Multi Agent Deep Reinforcement Learning; enabled by NAOMI
+
+## Important
+
+- Multi agent server/client example is not supported by NAOMI 0.1.0 please use source code to install the helm charts after modifying the values file:
+```bash
+cd helm_charts/NAOMI
+helm install naom . -f values_madrl_example.yaml
+```
+- Use values_madrl_example.yaml to install the helm charts with values for MADRL examples.
+- Install requirements.txt (same as NAOMI + gym and highway-env packages.
+
+
+## Examples
+These examples are in development and may require some tweaking of the configs to run.
+
+Change the VM_IP variable in the examples to the IP of the VM where NAOMI is deployed.
+
+### Multi agent highway environment with RLlib. Env is stepped by RLlib.
+
+- `multi_agent_highway.py`
+
+### External highway environment with RLlib, Client - Server configuration. Environment requests actions from RLlib server.
+
+- `highway_client.py`
+- `highway_client.py`
+
+Multi agent highway is not yet supported using external RLlib environment.
+
+
+## Infrastructure configuration
+Examples can be run on a single machine using multiple Ray workers or on multiple machines connected with k8s.
+In our config we utilize 1VM and 2 Raspberry Pi 5s. Ray workers are deployed on each RPI and on VM.
+Reinforcement learning examples are configured to run environment on RPI and Training of algorithms on VM.
+This can be modified in the examples depending on your infrastructure.
+
+This presents a use case of agents running on edge devices or on board units and training happening on a central server or roadside units.
+A use case of autonomous driving where agents are vehicles and training is done on a roadside units. When algorithm is trained on new observations the new reinforcement learning policies are sent to the agents.
+
+## Results of RL training using Tensorboard
+
+This works if RL is run with examples using RLlib and Ray Tune.
+
+```bash
+export AWS_ACCESS_KEY_ID=minio
+export AWS_SECRET_ACCESS_KEY=miniostorage
+export S3_ENDPOINT=http://<CHANGE_ME_TO_NAOMI_IP>:30085
+export S3_VERIFY_SSL=0
+export S3_USE_HTTPS=0
+tensorboard --logdir=s3://raybuck/rllib/ --host=0.0.0.0
+```
diff --git a/MADRL_examples/highway_client.py b/MADRL_examples/highway_client.py
@@ -0,0 +1,141 @@
+#!/usr/bin/env python
+
+"""
+Adapted from: https://github.com/ray-project/ray/blob/master/rllib/examples/envs/external_envs/cartpole_client.py
+
+Example of running an external simulator (a Highway env
+in this case) against an RLlib policy server listening on one or more
+HTTP-speaking port(s). See `highway_server.py` in this same directory for
+how to start this server.
+
+This script will only create one single env altogether to illustrate
+that RLlib can run w/o needing an internalized environment.
+
+Setup:
+1) Start the policy server:
+    See `highway_server.py` on how to do this.
+2) Run this client:
+    $ python highway_client.py --inference-mode=local|remote --[other options]
+      Use --help for help.
+
+# local should be used by default, remote did not work for me
+
+In "local" inference-mode, the action computations are performed
+inside the PolicyClient used in this script w/o sending an HTTP request
+to the server. This reduces network communication overhead, but requires
+the PolicyClient to create its own RolloutWorker (+Policy) based on
+the server's config. The PolicyClient will retrieve this config automatically.
+You do not need to define the RLlib config dict here!
+
+In "remote" inference mode, the PolicyClient will send action requests to the
+server and not compute its own actions locally. The server then performs the
+inference forward pass and returns the action to the client.
+
+In either case, the user of PolicyClient must:
+- Declare new episodes and finished episodes to the PolicyClient.
+- Log rewards to the PolicyClient.
+- Call `get_action` to receive an action from the PolicyClient (whether it'd be
+  computed locally or remotely).
+- Besides `get_action`, the user may let the PolicyClient know about
+  off-policy actions having been taken via `log_action`. This can be used in
+  combination with `get_action`, but will only work, if the connected server
+  runs an off-policy RL algorithm (such as DQN, SAC, or DDPG).
+"""
+
+import argparse
+import gymnasium as gym
+import highway_env
+from ray.rllib.env.policy_client import PolicyClient
+
+VM_IP="<CHANGE_ME>"
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--no-train", action="store_true", help="Whether to disable training."
+)
+parser.add_argument(
+    "--inference-mode", type=str, default="local", choices=["local", "remote"]
+)
+parser.add_argument(
+    "--off-policy",
+    action="store_true",
+    help="Whether to compute random actions instead of on-policy "
+    "(Policy-computed) ones.",
+)
+parser.add_argument(
+    "--stop-reward",
+    type=float,
+    default=9999,
+    help="Stop once the specified reward is reached.",
+)
+parser.add_argument(
+    "--port", type=int, default=30070, help="The port to use (on localhost)."
+)
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+
+    # The following line is the only instance, where an actual env will
+    # be created in this entire example (including the server side!).
+    # This is to demonstrate that RLlib does not require you to create
+    # unnecessary env objects within the PolicyClient/Server objects, but
+    # that only this following env and the loop below runs the entire
+    # training process.
+    env = gym.make('highway-fast-v0')
+
+    # Get and print the observation space
+    # You can copy this info to highway_server.py to define config
+    obs_space = env.observation_space
+    act_space = env.action_space
+    print("Observation space:", obs_space)
+    print("action space:", act_space)
+
+    # In our case by using NodePort all workers are listening on the same port and Loadbalancing is handled by Kuberentes
+    # Note that this is different to the original example provided by Ray RLlib
+    client = PolicyClient(
+        f"http://{VM_IP}:{args.port}", inference_mode=args.inference_mode
+    )
+
+    # In the following, we will use our external environment (the Highway env
+    # env we created above) in connection with the PolicyClient to query
+    # actions (from the server if "remote"; if "local" we'll compute them
+    # on this client side), and send back observations and rewards.
+
+    # Start a new episode.
+    obs, info = env.reset()
+    print(obs)
+    eid = client.start_episode(training_enabled=not args.no_train)
+
+    rewards = 0.0
+    while True:
+        # Compute an action randomly (off-policy) and log it.
+        if args.off_policy:
+            action = env.action_space.sample()
+            client.log_action(eid, obs, action)
+        # Compute an action locally or remotely (on server).
+        # No need to log it here as the action
+        else:
+            action = client.get_action(eid, obs)
+
+        # Perform a step in the external simulator (env).
+        obs, reward, terminated, truncated, info = env.step(action)
+        rewards += reward
+
+        # Log next-obs, rewards, and infos.
+        client.log_returns(eid, reward, info=info)
+        env.render()
+        # Reset the episode if done.
+        if terminated or truncated:
+            print("Total reward:", rewards)
+            if rewards >= args.stop_reward:
+                print("Target reward achieved, exiting")
+                exit(0)
+
+            rewards = 0.0
+
+            # End the old episode.
+            client.end_episode(eid, obs)
+
+            # Start a new episode.
+            obs, info = env.reset()
+            eid = client.start_episode(training_enabled=not args.no_train)