Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Mobile Aloha Inference in MuJoCo: Robot Wandering Without Performing Actions #24

Open
yongzhengqi opened this issue Nov 18, 2024 · 18 comments

Comments

@yongzhengqi
Copy link

Hi folks,

First off, I want to say amazing work—I'm really impressed by this project!

I've been trying to perform inference on the Mobile Aloha robot in MuJoCo, but I'm encountering an issue: the robot seems to wander aimlessly and doesn't perform any meaningful actions. Do you have any suggestions for resolving this?

Here’s my setup:

  • I’ve wrapped MuJoCo in a ROS node to interact with agilex_inference.py.
  • The simulation setup is cloned from Agilex's GitHub repository.

The only modifications I’ve made to RDT's code are adjustments to align the gripper action range with Agilex's setup (0 to 0.0475). Specifically:

  1. In _format_joint_to_state(), I changed:
    [[[1, 1, 1, 1, 1, 1, 4.7908, 1, 1, 1, 1, 1, 1, 4.7888]]]
    to:
    [[[1, 1, 1, 1, 1, 1, 0.0475, 1, 1, 1, 1, 1, 1, 0.0475]]]
  2. In _unformat_action_to_joint(), I changed:
    [[[1, 1, 1, 1, 1, 1, 11.8997, 1, 1, 1, 1, 1, 1, 13.9231]]]
    to:
    [[[1, 1, 1, 1, 1, 1, 0.0475, 1, 1, 1, 1, 1, 1, 0.0475]]]

Could this modification be causing the issue? Or is there another step I might have missed in setting up Mobile Aloha in MuJoCo for inference?

Thanks again for the great work—I'm eager to get this working and achieve good results in MuJoCo!

@yongzhengqi
Copy link
Author

Here's the video for the current result.

The command is Grab the soda can and put it in the plate. and I'm running your 1b model.
Screencast from 11-18-2024 12:02:33 AM.webm

@yongzhengqi
Copy link
Author

yongzhengqi commented Nov 18, 2024

And here's the images that each camera is getting at frame 0. Size for each camera is 480 * 640 * 3 (height, width, channel).

Left Camera

image

Right Camera

image

Front Camera

image

@yongzhengqi
Copy link
Author

yongzhengqi commented Nov 18, 2024

I noticed a discrepancy between the simulation environment and the joint range reported in your paper. For joint 3 of the arm, the paper specifies a range of -3.05433 to 0, while Agilex’s simulation setup defines the range as 0 to 3.14.

To address this, I tried flipping the joint position in _format_joint_to_state() and _unformat_action_to_joint(). However, despite this adjustment, the robot continues to exhibit meaningless, erratic movements. I also attempted to apply an offset of 3.054 in both functions, but unfortunately, that did not work either.

Please let me know if you need more information.

@yongzhengqi
Copy link
Author

At this point, I’m thinking that fine-tuning might be required to run inference in simulation environments. OpenVLA also seems to require fine-tuning to achieve reasonable performance on Libero.

On the other hand, @chjchjchjchjchj's pull request gives me hope: while the WidowX arm didn’t complete the task, it at least approached the spool (the task instruction is to place the spool on the towel).

@csuastt
Copy link
Collaborator

csuastt commented Nov 18, 2024

Yes, fine-tuning is needed.

By the way, do not use SimplerEnv currently. The origin of their coordinate system is different from that of the real-world data. We are trying to build simulation inference. Stay tuned!

@yongzhengqi
Copy link
Author

Thank you for your prompt reply! I’ll look forward to the good news.

Regarding the need for fine-tuning, do you think it’s primarily required on the perception side, the control side, or both?

@csuastt
Copy link
Collaborator

csuastt commented Nov 20, 2024

Both. We usually call it the embodiment gap.

@yongzhengqi
Copy link
Author

Philosophically speaking, is it correct to understand that a more diverse set of robots in the training set leads to a smaller embodiment gap in practice (i.e., less fine-tuning needed) when adapting to new robots?

@csuastt
Copy link
Collaborator

csuastt commented Nov 21, 2024

Yes, I think it is. However, at present, the embodiment diversity of pre-training datasets is far from enough.

@yongzhengqi
Copy link
Author

If the fine-tuning is solely for closing the embodiment gap, is it correct to assume that a diverse set of objects or tasks is not strictly necessary (although, of course, it would be beneficial to include them)? Can I assume the model has already learned aspects beyond the embodiment gap (e.g., visual reasoning, task planning, etc.) during pre-training?

@csuastt
Copy link
Collaborator

csuastt commented Nov 22, 2024

Yes, you are right.

@zzl410
Copy link

zzl410 commented Nov 24, 2024

We encountered the same issue in a real-world scenario and are unsure where the problem lies. We hope to receive some help.
The command is pour water from the bottle into the mug

Here's the video for the current result.

The command is Grab the soda can and put it in the plate. and I'm running your 1b model. Screencast from 11-18-2024 12:02:33 AM.webm

WeChat_20241124224257.mp4

@csuastt
Copy link
Collaborator

csuastt commented Nov 25, 2024

@zzl410 Have you fine-tuned the model? It seems quite abnormal...

@zzl410
Copy link

zzl410 commented Nov 25, 2024

Thank you for your attention.We did not perform fine-tuning on the models, using only two base models: rdt-1b and rdt-170m. We tested the following three instructions:

Pour water from the bottle into the mug.
Pick up the black marker on the right and put it into the packaging box on the left.
Fold the basketball shorts into a rectangle.

In all cases, a similar issue occurred where the robotic arm moved upwards during the grasping motion.
command is

python -m scripts.agilex_inference \
    --use_actions_interpolation \
    --pretrained_model_name_or_path /home/mobilealoha/RoboticsDiffusionTransformer/robotics-diffusion-transformer/rdt-1b \
    --lang_embeddings_path outs/Pour_water.pt \
    --ctrl_freq 25

@zzl410
Copy link

zzl410 commented Nov 25, 2024

Interestingly, even in a completely dark experimental environment, the robotic arm still exhibits the same issue. We verified the reception of camera image data, confirming that it is complete and accurate.

@csuastt
Copy link
Collaborator

csuastt commented Nov 25, 2024

You should fine-tune first. Since the pre-trained checkpoint does not see your embodiment before.

@ROSKING
Copy link

ROSKING commented Nov 27, 2024

Yes, fine-tuning is needed.

By the way, do not use SimplerEnv currently. The origin of their coordinate system is different from that of the real-world data. We are trying to build simulation inference. Stay tuned!

Nice! when will the simulation inference version be released?

@zzl410
Copy link

zzl410 commented Nov 27, 2024

你应该先进行微调。因为预训练的检查点之前没有见过你的实例。

thank you ,it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants