Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openvla policy intergration #10

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

DelinQu
Copy link

@DelinQu DelinQu commented Jul 5, 2024

This pull request integrates openvla policy. The evaluation scripts remain consistent with the original repo under ./scripts/

@xuanlinli17
Copy link
Collaborator

Thanks for the contribution! Could you post the success rates for each task?

Could you post a source for the implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, sticky gripper match, and do they match the real eval?

Code comments:

  • Remove " + OpenVLA policy" in readme
  • OpenVALInference -> OpenVLAInference
  • self.sticky_action_is_on is False in L147 of openvla_model.py: change to (not self.sticky_action_is_on)

@DelinQu
Copy link
Author

DelinQu commented Jul 6, 2024

Hi xuanlinli17, I have corrected all the typos. The implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, etc., are migrating from octo. The evaluation results can be downloaded at
report_openvla.log

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 6, 2024

I see. Might want to get help from the official authors to validate / revise the implementation as the Bridge results are near zero for some reason, and pick coke can has large variance across different backgrounds. Additionally, it's possible that OpenVLA might not follow Octo implementation in real deployment.

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 6, 2024

Also you can modify https://github.com/simpler-env/SimplerEnv/blob/main/tools/calc_metrics_evaluation_videos.py to quickly summarize the results for OpenVLA (just put dummy numbers for the real numbers and don't push the script). You can ignore the nans.

@HuFY-dev
Copy link

HuFY-dev commented Jul 7, 2024

I'm also working on implementing OpenVLA into SimplerEnv, and I had the same issue: OpenVLA fails drastically on Bridge. I wonder if that has anything to do with the controller mentioned in #11

@sombitd
Copy link

sombitd commented Jul 8, 2024

Same here, Severe lack of performance of OpenVLA on WindowX robot.

@xuanlinli17
Copy link
Collaborator

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

@xuanlinli17
Copy link
Collaborator

OpenVLA setup requirements:

pip install torch==2.3.1 torchvision==0.18.1 timm==0.9.10 tokenizers==0.15.2 accelerate==0.32.1
pip install flash-attn==2.6.1 --no-build-isolation

Please add these instructions to Readme and add an "OpenVLA Inference Setup" section

@xuanlinli17
Copy link
Collaborator

Typos in scripts/openvla_put_in_drawer_variant_agg.sh and scripts/openvla_put_in_drawer_visual_matching.sh: should be "declare -a ckpt_paths=("openvla/openvla-7b")"

@erjiaxiao
Copy link

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

Hello @xuanlinli17, I tried the code above but openvla still fails on widowx tasks. Is it possibly an implementation problem? Should I set any params for widowx?

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 22, 2024

Yeah that's my finding too, but I don't think the authors did some special treatments to evaluate OpenVLA on Bridge.

There might be some coordinate transforms on Bridge which is different.

@erjiaxiao
Copy link

@xuanlinli17 Thank you! I will continue to look into this. If I find any additional information or solutions, I'll make sure to share it with you.

@RZFan525
Copy link

When I run the scripts bash scripts/openvla_bridge.sh in ./SimplerEnv-OpenVLA, an error appears:

from simpler_env.policies.openvla.openvla_model import OpenVLAInference

ModuleNotFoundError: No module named 'simpler_env.policies.openvla'

How to fix it?

@xuanlinli17
Copy link
Collaborator

^ Run pip install -e . in the SimplerEnv-OpenVLA repo. Also I don't think the authors have updated the scripts yet. I'll just add try to push some updates.

@hilookas
Copy link

hilookas commented Jul 29, 2024

I have made a run, here is the full result of vla:

I use my branch as codebase: https://github.com/hilookas/SimplerEnv

(note: the "real" result is set to 0)

(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=pick_coke_can
***Pick coke can results***
--------------------
horizontal sim variant avg success {'openvla': 0.711111111111111}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
vertical sim variant avg success {'openvla': 0.27111111111111114}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
standing sim variant avg success {'openvla': 0.6533333333333333}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
avg_orientation_sim_variant_results [0.5451851851851851]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_variant_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_variant_results, avg_orientation_real_results) 1
--------------------
Orientation horizontal, ckpt openvla all robot arm visual matching success: [0.28, 0.2, 0.32, 0.28]
Orientation vertical, ckpt openvla all robot arm visual matching success: [0.0, 0.08, 0.04, 0.0]
Orientation standing, ckpt openvla all robot arm visual matching success: [0.2, 0.08, 0.36, 0.12]
horizontal visual matching sim success {'openvla': 0.27}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
horizontal kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=7.976744186046492, pvalue=0.004738208143666835)
vertical visual matching sim success {'openvla': 0.03}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
vertical kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999995735, pvalue=0.3173105078630144)
standing visual matching sim success {'openvla': 0.19}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
standing kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.444444444444439, pvalue=0.019630657257290737)
avg_orientation_sim_visual_matching_results [0.16333333333333336]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=12.956521739130327, pvalue=0.0003188089440220541)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=move_near
***Move Near results***
--------------------
sim variant avg success {'openvla': 0.4770833333333333}
real success {'openvla': 0}
MMRV 0.0
pearson correlation 1
--------------------
Ckpt openvla all robot arm visual matching success: [0.5333333333333333, 0.43333333333333335, 0.4, 0.48333333333333334]
sim visual matching success {'openvla': 0.4625}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=36.21739130434784, pvalue=1.7648851059881935e-09)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=drawer
***Drawer results***
--------------------
open sim variant avg success {'openvla': 0.15873015873015872}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
close sim variant avg success {'openvla': 0.19576719576719576}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
avg_sim_variant_results [0.17724867724867724]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_variant_results, avg_real_results) 0.0
pearson_correlation(avg_sim_variant_results, avg_real_results) 1
--------------------
Drawer task open, ckpt openvla all robot arm visual matching success: [0.2222222222222222, 0.2222222222222222, 0.1388888888888889, 0.2222222222222222]
Drawer task close, ckpt openvla all robot arm visual matching success: [0.5555555555555556, 0.5, 0.5, 0.5185185185185186]
open visual matching sim success {'openvla': 0.19444444444444442}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
open kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.408163265306165, pvalue=0.02004279392017093)
close visual matching sim success {'openvla': 0.5185185185185185}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
close kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=18.549999999999994, pvalue=1.655052286653112e-05)
avg_sim_visual_matching_results [0.35648148148148145]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_visual_matching_results, avg_real_results) 0.0
pearson_correlation(avg_sim_visual_matching_results, avg_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=22.842696629213542, pvalue=1.7581614124344239e-06)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=bridge_put_on
***Bridge Put On Env results***
********** Results for put_spoon_on_tablecloth **********
sim visual matching partial success {'openvla': 0.041666666666666664}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_carrot_on_plate **********
sim visual matching partial success {'openvla': 0.3333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=9.399999999999977, pvalue=0.002169854427130147)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for stack_green_block_on_yellow_block **********
sim visual matching partial success {'openvla': 0.125}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=3.133333333333267, pvalue=0.07670675171090896)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_eggplant_in_basket **********
sim visual matching partial success {'openvla': 0.08333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=2.0434782608695747, pvalue=0.15285977016579425)
sim visual matching success {'openvla': 0.041666666666666664}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
********************



@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 29, 2024

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...

horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

@Toradus
Copy link

Toradus commented Aug 3, 2024

@hilookas Thanks for providing the results!
It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}".
Is it possible for you to also do that evaluation and edit your results list?

@QuanyiLi
Copy link

QuanyiLi commented Aug 3, 2024

@hilookas Thank you for your great work! I want to try out the openVLA in the simulator as well. But I wonder why the performance in the sim is not as good as what is claimed by the paper in the real-world benchmark. It should not be due to the sim-to-real gap, right? Cause SIMPLER is designed to mitigate this gap.

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Aug 3, 2024

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc).
image

@QuanyiLi
Copy link

QuanyiLi commented Aug 3, 2024

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc). image

Thanks. Look forward to the updated results!

@hilookas
Copy link

hilookas commented Aug 4, 2024

@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?

Sure! I have update the result. Please see log above.

My result is slightly different but not much from xuanlinli17's run in

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...

horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

@hilookas
Copy link

hilookas commented Aug 4, 2024

Here is updated table:

image

image

@yxchng
Copy link

yxchng commented Aug 5, 2024

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

@QuanyiLi
Copy link

QuanyiLi commented Aug 5, 2024

@yxchng It takes me 15G vram for 4090, following the official instructions to use bf16.

@hilookas
Copy link

hilookas commented Aug 7, 2024

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

1x3090 is enough. Just remember don't open env and inference process more than 1.

@SiyuanHuang95
Copy link

Here is updated table:

image

image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

@hilookas
Copy link

hilookas commented Aug 8, 2024

Here is updated table:
image
image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

@SiyuanHuang95
Copy link

Here is updated table:
image
image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

I do not have enough GPU resources attached to a Screen the Sapien simulator requires. :-(, running OpenVLA locally is quite a burden for most consumer-level PC.

@xuanlinli17 xuanlinli17 self-requested a review August 19, 2024 02:48
@4evertutelary
Copy link

Easy to succeed on google robot task BUT always failed on WidowX related task.
(prompt = "In: What action should the robot take to {}?\nOut:")

@xuanlinli17
Would you mind having a try on WidowX?

@xuanlinli17
Copy link
Collaborator

Simpler-OpenVLA on WidowX is known to have some strange behaviors that I don't yet know why... Would you investigate too?

@cyanzhao42
Copy link

cyanzhao42 commented Aug 24, 2024

Here is updated table:

image

image

Hi, i am wondering what are the experiment setup to get the numbers in the paper - are those number obtained by running - SimplerEnv/scripts, i.e. for "pick_coke_can_visual_matching" it is basically 100 trial (25*4) for each coke can layout? Thanks!

@xuanlinli17
Copy link
Collaborator

Here is updated table:
image
image

Hi, i am wondering what are the experiment setup to get the numbers in the paper - are those number obtained by running - SimplerEnv/scripts, i.e. for "pick_coke_can_visual_matching" it is basically 100 trial (25*4) for each coke can layout? Thanks!

Yes.

@hahans
Copy link

hahans commented Aug 24, 2024

Screenshot from 2024-08-24 11-14-44
How do I solve this problem

@xuanlinli17
Copy link
Collaborator

Screenshot from 2024-08-24 11-14-44 How do I solve this problem

Please run all scripts under the root SimplerEnv repo.

@hahans
Copy link

hahans commented Aug 24, 2024

Screenshot from 2024-08-24 11-14-44 How do I solve this problem

Please run all scripts under the root SimplerEnv repo.

thanks.

@hahans
Copy link

hahans commented Aug 24, 2024

Screenshot from 2024-08-24 12-51-45
I run the bash ,but nothing happened.
Is there a configuration problem?

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Aug 24, 2024

Please follow the troubleshooting section in readme; if issue still persists, please open a new issue & discussion

@hahans
Copy link

hahans commented Aug 24, 2024

@xuanlinli17 Maybe I didn't make myself clear, what I meant was that I ran the bash file inside the scripts file, but the robot simulation screen didn't appear. What am I supposed to do? Make the picture appear.

@eyucherin
Copy link

eyucherin commented Sep 8, 2024

@xuanlinli17 @DelinQu Hi! I am trying to run openVLA on simple Env and I am getting this error when I run pip install flash-attn==2.6.1 --no-build-isolation

Here is the error message.

Collecting flash-attn==2.6.1
  Using cached flash_attn-2.6.1.tar.gz (2.6 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-ehx9vciz/flash-attn_3fef33cfddd542b995a983488d613f27/setup.py", line 115, in <module>
          raise RuntimeError(
      RuntimeError: FlashAttention is only supported on CUDA 11.6 and above.  Note: make sure nvcc has a supported version by running nvcc -V.


      torch.__version__  = 2.3.1+cu121


      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

and here is my GPU information

nvidia-smi
Sun Sep  8 22:01:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 31%   31C    P8             21W /  450W |     557MiB /  24564MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1177      G   /usr/lib/xorg/Xorg                            295MiB |
|    0   N/A  N/A      1312      G   /usr/bin/gnome-shell                           86MiB |
|    0   N/A  N/A      4067      G   ...seed-version=20240906-130113.352000         73MiB |
|    0   N/A  N/A      8205      G   ...erProcess --variations-seed-version         21MiB |
|    0   N/A  N/A     20334    C+G   warp-terminal                                  31MiB |
+-----------------------------------------------------------------------------------------+

any insight on this would be greatly appreciated!

@xuanlinli17
Copy link
Collaborator

^ I think you need a local cuda version >= 11.6, probably match your torch version.

@RZFan525
Copy link

RZFan525 commented Oct 7, 2024

When I run openvla_bridge.sh, I encounter an error:
scripts/openvla_bridge.sh: line 18: 92071 Floating point exception(core dumped)
I debug and find that the problem is in def step of class OpenVLAInference:
raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
I don't know why. But It works when I run a simple script to use OpenVLA model:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("/home/rzfan/ckpts/openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "/home/rzfan/ckpts/openvla/openvla-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = Image.open("datasets/traj0/images0/im_0.jpg")
INSTRUCTION = "Put the cucumber next to the banana"
prompt = f"In: What action should the robot take to {INSTRUCTION}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
print(inputs)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)

My computer is GTX 4090 and Ubuntu.

@lucywang720
Copy link

hello,

I'm curious why you didn't add 'In: What action should the robot take to {INSTRUCTION}?\nOut:' when inputting prompts to the processor? Would adding or not adding this sentence affect the results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.