Feat Sebulba recurrent IQL #1148

Louay-Ben-nessir · 2024-12-04T14:01:10Z

What?

A recurrent IQL implementation using the Sebulba architecture.

Why?

Offline Sebulba base and non-jax envs in Mava.

How?

Mixed the Sebulba structure from PPO with the learner code from Anakin IQL.

…er acotr

sash-a

I've looked through everything except the system file and it looks good, Sebulba utils especially! Just some relatively minor style changes

sash-a · 2025-01-07T09:58:42Z

mava/configs/system/q_learning/rec_iql.yaml

@@ -11,13 +11,13 @@ add_agent_id: True
 min_buffer_size: 32
 update_batch_size: 1 # Number of vectorised gradient updates per device.

-rollout_length: 2 # Number of environment steps per vectorised environment.
+rollout_length:  2 # Number of environment steps per vectorised enviro²nment.


Not sure how the ^2 got in there 😂

Suggested change

rollout_length: 2 # Number of environment steps per vectorised enviro²nment.

rollout_length: 2 # Number of environment steps per vectorised environment.

sash-a · 2025-01-07T10:42:41Z

mava/systems/q_learning/types.py

@@ -21,22 +21,22 @@
 from jumanji.env import State
 from typing_extensions import NamedTuple, TypeAlias

-from mava.types import Observation
+from mava.types import MavaObservation, Observation


Where do we still use Observation? Can we use MavaObservation everywhere?

sash-a · 2025-01-07T10:44:39Z

mava/utils/config.py

+    # PPO specifique check
+    if "num_minibatches" in config.system:
+        assert num_eval_samples % config.system.num_minibatches == 0, (
+            f"Number of training samples per evaluator ({num_eval_samples})"
+            + f"must be divisible by num_minibatches ({config.system.num_minibatches})."
+        )


A thought on this, maybe we can split these up into multiple methods e.g check_num_updates, check_num_envs etc. Then have a check_sebulba_config_ppo, check_anakin_config_ppo and a check_sebulba_config_iql which will use the relevant methods?

sash-a · 2025-01-07T10:55:57Z

mava/utils/sebulba.py


-# todo: remove the ppo dependencies when we make sebulba for other systems


This is a good point though, maybe there's something we can do about it 🤔

Maybe a protocol like that has action, obs, reward, not sure if there's any other common attributes?

sash-a · 2025-01-07T11:00:19Z

mava/utils/sebulba.py

+
+
+# from https://github.com/EdanToledo/Stoix/blob/feat/sebulba-dqn/stoix/utils/rate_limiters.py
+class RateLimiter:


This file is getting quite long, can we structure it like this:

-utils/ --sebulba/ ---utils.py ---rate_limiters.py ---pipelines.py

sash-a · 2025-01-07T11:18:03Z

mava/utils/sebulba.py

+
+        self.rate_limiter.sample()
+
+        if not self._queue.empty():


Can we rename this to metrics_queue, wasn't clear what this was storing

sash-a · 2025-01-07T11:30:23Z

mava/utils/sebulba.py

+
+        self.inserts = 0.0
+        self.samples = 0
+        self.deletes = 0


Wonder if we need deletes here?

sash-a · 2025-01-07T11:31:33Z

mava/utils/sebulba.py

+    def __init__(
+        self, samples_per_insert: float, min_size_to_sample: int, min_diff: float, max_diff: float
+    ):


Can we please add a good doc string here please 🙏

sash-a · 2025-01-07T11:36:23Z

mava/utils/sebulba.py

+        Raises:
+          ValueError: If error_buffer is smaller than max(1.0, samples_per_inserts).
+        """
+        if isinstance(error_buffer, float) or isinstance(error_buffer, int):


nit

Suggested change

if isinstance(error_buffer, float) or isinstance(error_buffer, int):

if isinstance(error_buffer, (int, float)):

sash-a · 2025-01-07T11:37:45Z

mava/wrappers/gym.py

+        terminated = np.repeat(
+            terminated[..., np.newaxis], repeats=self.num_agents, axis=-1
+        )  # (B,) --> (B, N)


Does this already happen for smax and lbf?

sash-a

Great work here! Really minor changes required. Happy to merge this pending some benchmarks

sash-a · 2025-01-08T12:54:59Z

mava/systems/q_learning/sebulba/rec_iql.py

+        action = eps_greedy_dist.sample(seed=key)
+        action = action[0, ...]  # (1, B, A) -> (B, A)


Bit safer as this will error if actions 0 dim is every larger than 1

Suggested change

action = eps_greedy_dist.sample(seed=key)

action = action[0, ...] # (1, B, A) -> (B, A)

action = eps_greedy_dist.sample(seed=key).squeeze(0) # (B, A)

sash-a · 2025-01-08T12:58:24Z

mava/systems/q_learning/sebulba/rec_iql.py

+                    next_timestep = env.step(cpu_action)
+
+                # Prepare the transation
+                terminal = (1 - timestep.discount[..., 0, jnp.newaxis]).astype(bool)


Are you sure we want to remove the agent dim here?

sash-a · 2025-01-08T13:00:19Z

mava/systems/q_learning/sebulba/rec_iql.py

+                target: Array,
+            ) -> Tuple[Array, Metrics]:
+                # axes switched here to scan over time
+                hidden_state, obs_term_or_trunc = prep_inputs_to_scannedrnn(obs, term_or_trunc)


A general comment, I think this would be a lot easier to read if we used done to mean term_or_trunc which I think is a reasonable thing. Would have to make the change in anakin also though :/

sash-a · 2025-01-08T13:03:13Z

mava/systems/q_learning/sebulba/rec_iql.py

+        timing_dict = tree.map(lambda *x: np.mean(x), *rollout_times) | learn_times
+        timing_dict = tree.map(np.mean, timing_dict, is_leaf=lambda x: isinstance(x, list))


2 things, can we call this time_metrics and can you add a shape explainer comment as it's a bit hard to work out what is happening here

sash-a · 2025-01-08T13:39:32Z

mava/systems/q_learning/sebulba/rec_iql.py

+        """
+
+        eps = jnp.maximum(
+            config.system.eps_min, 1 - (t / config.system.eps_decay) * (1 - config.system.eps_min)


Would be nice if we could set a different decay per actor, although I think that's out of scope for this PR. Maybe if you could make an issue to add in some of the ape-X DQN features that would be great

sash-a · 2025-01-08T13:46:27Z

mava/systems/q_learning/sebulba/rec_iql.py

+]:
+    """Initialise learner_fn, network and learner state."""
+
+    # create temporory envoirnments.


nit

Suggested change

# create temporory envoirnments.

# create temporary environments.

sash-a · 2025-01-08T13:51:46Z

mava/configs/system/q_learning/rec_iql.yaml

@@ -31,3 +31,7 @@ gamma: 0.99  # discount factor

 eps_min: 0.05
 eps_decay: 1e5
+
+# --- Sebulba parameters ---
+data_sample_mean: 150  # Average number of times the learner should sample each item from the replay buffer.


Can we rather call this: mean_data_sample_rate. Wasn't clear to me what it was when I read it in the system file

sash-a · 2025-01-08T13:53:04Z

mava/systems/q_learning/sebulba/rec_iql.py

+    config.sample_per_insert = config.system.data_sample_mean * insert_to_sample_ratio
+    config.tolerance = config.sample_per_insert * config.system.error_tolerance
+
+    min_num_inserts = max(
+        config.system.sample_sequence_length // config.system.rollout_length,
+        config.system.min_buffer_size // config.system.rollout_length,
+        1,
+    )
+
+    rate_limiter = SampleToInsertRatio(config.sample_per_insert, min_num_inserts, config.tolerance)


We should probably put them in the system config so it's easier to find on things like neptune

Suggested change

config.sample_per_insert = config.system.data_sample_mean * insert_to_sample_ratio

config.tolerance = config.sample_per_insert * config.system.error_tolerance

min_num_inserts = max(

config.system.sample_sequence_length // config.system.rollout_length,

config.system.min_buffer_size // config.system.rollout_length,

1,

)

rate_limiter = SampleToInsertRatio(config.sample_per_insert, min_num_inserts, config.tolerance)

config.system.sample_per_insert = config.system.data_sample_mean * insert_to_sample_ratio

config.system.tolerance = config.sample_per_insert * config.system.error_tolerance

min_num_inserts = max(

config.system.sample_sequence_length // config.system.rollout_length,

config.system.min_buffer_size // config.system.rollout_length,

1,

)

rate_limiter = SampleToInsertRatio(config.system.sample_per_insert, min_num_inserts, config.system.tolerance)

sash-a · 2025-01-08T13:56:17Z

mava/systems/q_learning/sebulba/rec_iql.py

+            train_metrics["learner_step"] = (eval_step + 1) * config.system.num_updates_per_eval
+            train_metrics["learner_steps_per_second"] = (
+                config.system.num_updates_per_eval
+            ) / time_metrics["learner_time_per_eval"]
+            logger.log(train_metrics, t, eval_step, LogEvent.TRAIN)


I think we should always log train metrics even if an episode hasn't finished yet, what do you think?

sash-a · 2025-01-08T13:59:10Z

mava/systems/q_learning/sebulba/rec_iql.py

+                episode_return=episode_return,
+            )
+
+        if config.arch.absolute_metric and max_episode_return <= episode_return:


I never know what order the bools will be evaluated in so I always add brackets, because it might be doing (config.arch.absolute_metric and max_episode_return) <= episode_return

Suggested change

if config.arch.absolute_metric and max_episode_return <= episode_return:

if config.arch.absolute_metric and (max_episode_return <= episode_return):

Louay-Ben-nessir added 4 commits November 18, 2024 09:47

feat: inital iql

23f5d0c

fix: concat of trajs from diffrent actors

ee4834f

fix: deadlock caused by deleting when buffer is full

7e44d15

fix: major changes to the ratelimiter configs and a separate buffer p…

6c8452f

…er acotr

pull-request-size bot added the size/XXL label Dec 4, 2024

Louay-Ben-nessir self-assigned this Dec 4, 2024

Louay-Ben-nessir and others added 2 commits January 4, 2025 20:13

docs: minor comment chnage

834b528

Merge branch 'develop' into feat-sebulba-rec-iql

6ab5197

Louay-Ben-nessir marked this pull request as ready for review January 4, 2025 19:57

Louay-Ben-nessir requested review from RuanJohn, sash-a, OmaymaMahjoub, WiemKhlifi and SimonDuToit as code owners January 4, 2025 19:57

Louay-Ben-nessir mentioned this pull request Jan 4, 2025

Feat: c envs support #1152

Draft

sash-a requested changes Jan 7, 2025

View reviewed changes

Merge branch 'develop' into feat-sebulba-rec-iql

312c280

sash-a requested changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat Sebulba recurrent IQL #1148

Feat Sebulba recurrent IQL #1148

Louay-Ben-nessir commented Dec 4, 2024

sash-a left a comment

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a Jan 7, 2025

sash-a left a comment

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

sash-a Jan 8, 2025

	rollout_length: 2 # Number of environment steps per vectorised enviro²nment.
	rollout_length: 2 # Number of environment steps per vectorised environment.


		# todo: remove the ppo dependencies when we make sebulba for other systems



		# from https://github.com/EdanToledo/Stoix/blob/feat/sebulba-dqn/stoix/utils/rate_limiters.py
		class RateLimiter:

	if isinstance(error_buffer, float) or isinstance(error_buffer, int):
	if isinstance(error_buffer, (int, float)):

		action = eps_greedy_dist.sample(seed=key)
		action = action[0, ...] # (1, B, A) -> (B, A)

		timing_dict = tree.map(lambda x: np.mean(x), rollout_times) \| learn_times
		timing_dict = tree.map(np.mean, timing_dict, is_leaf=lambda x: isinstance(x, list))

	# create temporory envoirnments.
	# create temporary environments.

	if config.arch.absolute_metric and max_episode_return <= episode_return:
	if config.arch.absolute_metric and (max_episode_return <= episode_return):

Feat Sebulba recurrent IQL #1148

Are you sure you want to change the base?

Feat Sebulba recurrent IQL #1148

Conversation

Louay-Ben-nessir commented Dec 4, 2024

What?

Why?

How?

sash-a left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sash-a left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment