Evaluating the reward function for arbitrary states and actions #326

jamesheald · 2023-03-31T01:04:55Z

jamesheald
Mar 31, 2023

I'm implementing model predictive control and I want to be able to evaluate the reward associated with a rollout performed under a learned model. Is there a generic way (that works for any environment) to evaluate the reward associated with a state-action(-next state, if applicable) tuple?

btaba · 2023-03-31T02:41:20Z

btaba
Mar 31, 2023
Maintainer

If you're using the Env API, the reward would be stored in the State:

brax/brax/envs/env.py

Line 36 in b373f5a

reward: jp.ndarray

and if you want Transitions, you could use the actor_step function:

brax/brax/training/acting.py

Lines 31 to 51 in b373f5a

    
           def actor_step( 
        
               env: envs.Env, 
        
               env_state: envs.State, 
        
               policy: Policy, 
        
               key: PRNGKey, 
        
               extra_fields: Sequence[str] = () 
        
           ) -> Tuple[envs.State, Transition]: 
        
             """Collect data.""" 
        
             actions, policy_extras = policy(env_state.obs, key) 
        
             nstate = env.step(env_state, actions) 
        
             state_extras = {x: nstate.info[x] for x in extra_fields} 
        
             return nstate, Transition(  # pytype: disable=wrong-arg-types  # jax-ndarray 
        
                 observation=env_state.obs, 
        
                 action=actions, 
        
                 reward=nstate.reward, 
        
                 discount=1 - nstate.done, 
        
                 next_observation=nstate.obs, 
        
                 extras={ 
        
                     'policy_extras': policy_extras, 
        
                     'state_extras': state_extras 
        
                 })

7 replies

jamesheald Apr 3, 2023
Author

Thank you for your response. It would be useful, in a model-based RL context, to be able to evaluate the reward function without having to simulate a state transition via env.step (or write the reward function manually). I would like to test the model I'm developing on lots of different environments, and manually writing reward functions for all of them is a bit tedious (and potentially error prone). Maybe this feature could be added?

btaba Apr 3, 2023
Maintainer

Hi @jamesheald, I'm not really understanding what the requested feature would be. Are you trying to programmatically generate rewards for arbitrary environments? It sounds like you also don't want $r(s, s')$ but $r(s)$ (reward not dependent on the state transition)?

jamesheald Apr 3, 2023
Author

I would ideally like to be able to evaluate the (standard, pre-specified) reward function of any brax environment for any s, a and s' without having to (re)write the reward function myself. My understand is that currently the only way to get a reward from brax is to call env.step (with arguments s an r), which performs a state transition in the environment (returning s' and r).

btaba Apr 3, 2023
Maintainer

Ok I think I understand! The most generic reward function is $r(s, a, s')$, and to get $s'$ one needs to calculate the env.step $f(s, a) = s'$. This is more or less why the reward function $r$ is rolled up in env.step, as found in other Env APIs as well such as gymnasium etc.
For model-based RL, I could see why you'd want a separate $r$. A short-term alternative is to override the pipeline_step function

brax/brax/envs/env.py

Line 116 in b373f5a

def pipeline_step(

to get the $s'$ you'd want the reward to be calculated for. Another alternative is to learn the reward along with the transition, as done in many model-based RL algorithms

jamesheald Apr 4, 2023
Author

Ok thanks for the response. In general, I think it would be nice to have a separate reward function that the user can call directly (and that env.step calls) for model-based scenarios where you only want to learn the transition dynamics but not the reward function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating the reward function for arbitrary states and actions #326

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Evaluating the reward function for arbitrary states and actions #326

jamesheald Mar 31, 2023

Replies: 1 comment · 7 replies

btaba Mar 31, 2023 Maintainer

jamesheald Apr 3, 2023 Author

btaba Apr 3, 2023 Maintainer

jamesheald Apr 3, 2023 Author

btaba Apr 3, 2023 Maintainer

jamesheald Apr 4, 2023 Author

jamesheald
Mar 31, 2023

Replies: 1 comment 7 replies

btaba
Mar 31, 2023
Maintainer

jamesheald Apr 3, 2023
Author

btaba Apr 3, 2023
Maintainer

jamesheald Apr 3, 2023
Author

btaba Apr 3, 2023
Maintainer

jamesheald Apr 4, 2023
Author