Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-Tuning in Relative Action Space #14

Open
lakomchik opened this issue Nov 6, 2024 · 9 comments
Open

Fine-Tuning in Relative Action Space #14

lakomchik opened this issue Nov 6, 2024 · 9 comments

Comments

@lakomchik
Copy link

I woule like for fine-tune RDT in a relative action space and have a question regarding the best method for mapping actions and proprieception.

Question: For fine-tuning a model in relative action space, would it be preferable to:

  • Map relative joint positions directly into a unified action space, as per existing guidelines?
  • Normalize values (e.g., scaling from -1 to 1) and then project these normalized values into the unified action space?

Using relative actions results in a smaller range for proprioception and action values. I’m curious if normalizing these values could help them better align with the model’s expected action space.

@csuastt
Copy link
Collaborator

csuastt commented Nov 8, 2024

Map them into the velocity slots in the unified action space (e.g., delta eef positions should be in eef position velocities). You could do normalization when fine-tuning.

@lakomchik
Copy link
Author

@csuastt Thank you for your answer!

@budzianowski
Copy link

budzianowski commented Nov 14, 2024

@csuastt - all preprocessing scripts are using eef_delta_pos_x (for example:

"eef_delta_pos_x,eef_delta_pos_y,eef_delta_pos_z, eef_delta_angle_x, eef_delta_angle_y, eef_delta_angle_z, eef_delta_angle_w, gripper_open")
(all the other preprocess_scripts for OXE as well).
I can't find the place where these slots are formatted to eef_vel_x?

@alik-git
Copy link

@csuastt Follow up question: In the example above you can see the model is predicting eef_ang_x, y, z, w for a quaternion, but I don't see a way to map these directly to velocities, because in the STATE_VEC_IDX_MAPPING all the angular velocities are roll pitch raw only, see here

'eef_angular_vel_roll': 42,

Could you please clarify which indices and how exactly you are mapping the quaternion eef_ang_x, y, z, w to STATE_VEC_IDX_MAPPING. That would be greatly appreciated. Thank you!

@csuastt
Copy link
Collaborator

csuastt commented Nov 15, 2024

@alik-git @budzianowski Sorry, it is our mistake:( In the current implementation, we do not use any action in TFDataset. We use future states instead. To use action, you may need to do some modifications:

  1. In this line, remove the function converting RPY to quat, it was a mistake and we forgot to delete:

eef_ang = euler_to_quaternion(eef_ang)

The original action is already delta RPY, which is the angle velocity.

  1. you may need to modify the follow-up preprocessing script to make the producer generate the action instead of future states. See this readme:

https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md?plain=1#L242

@budzianowski
Copy link

Thanks for the prompt reply, this is very helpful! One more question - the model used in the demos from the paper is also following the this logic or the finetuning was performed with the modified logic?

@ethan-iai
Copy link
Contributor

ethan-iai commented Nov 16, 2024

To clarify, in the demos mentioned in the paper, we predict the actions rather than future states. It depends on your robot, in our robot (ALOHA), the future states and actions are different. Please let me know if you’d like further details!

@budzianowski
Copy link

@ethan-iai Thanks for helpful explanation! If that's the case I'm still puzzled by the finetuning agilex setup where the actions are used?

@alik-git
Copy link

@ethan-iai @csuastt I just want to clarify when you say "predict the actions" are you saying that the neural network model directly outputs action deltas as the logits? or are you saying that the model directly outputs future states, and then you use that to manually compute the action deltas (future_state - current_state = action_deltas)?

The reason I ask is that during pretraining the model predicts future states as the logits (please correct me if that's wrong), so why not keep that consistent during fine tuning as well?

Just for context: we are trying to evaluate RDT on controlling a widowx robot arm. We are wondering if during our finetuning, it would be better to have the gt labels be the future states, and then compute the action deltas manually during deployment, or finetune with action deltas directly as the gt_labels. My naive assumption was that finetuning with action deltas directly would be worse since the model has to relearn more stuff (due to differences in scale and representation, e.g., smaller ranges for action deltas compared to joint positions).

But if you finetuned directly on action deltas (and empirically found it to be better) then we should reconsider our approach of finetuning on future states. Sorry for the long question, I just wanted to be extra clear in what the confusion is. Thank you for your time in answering all these questions, we greatly appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants