openvla policy intergration #10

DelinQu · 2024-07-05T10:27:04Z

This pull request integrates openvla policy. The evaluation scripts remain consistent with the original repo under ./scripts/

xuanlinli17 · 2024-07-05T16:10:49Z

Thanks for the contribution! Could you post the success rates for each task?

Could you post a source for the implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, sticky gripper match, and do they match the real eval?

Code comments:

Remove " + OpenVLA policy" in readme
OpenVALInference -> OpenVLAInference
self.sticky_action_is_on is False in L147 of openvla_model.py: change to (not self.sticky_action_is_on)

DelinQu · 2024-07-06T14:00:54Z

Hi xuanlinli17, I have corrected all the typos. The implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, etc., are migrating from octo. The evaluation results can be downloaded at
report_openvla.log

xuanlinli17 · 2024-07-06T15:15:39Z

I see. Might want to get help from the official authors to validate / revise the implementation as the Bridge results are near zero for some reason, and pick coke can has large variance across different backgrounds. Additionally, it's possible that OpenVLA might not follow Octo implementation in real deployment.

xuanlinli17 · 2024-07-06T15:19:24Z

Also you can modify https://github.com/simpler-env/SimplerEnv/blob/main/tools/calc_metrics_evaluation_videos.py to quickly summarize the results for OpenVLA (just put dummy numbers for the real numbers and don't push the script). You can ignore the nans.

HuFY-dev · 2024-07-07T01:55:59Z

I'm also working on implementing OpenVLA into SimplerEnv, and I had the same issue: OpenVLA fails drastically on Bridge. I wonder if that has anything to do with the controller mentioned in #11

sombitd · 2024-07-08T09:20:26Z

Same here, Severe lack of performance of OpenVLA on WindowX robot.

xuanlinli17 · 2024-07-20T00:07:59Z

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

xuanlinli17 · 2024-07-20T18:06:10Z

OpenVLA setup requirements:

pip install torch==2.3.1 torchvision==0.18.1 timm==0.9.10 tokenizers==0.15.2 accelerate==0.32.1
pip install flash-attn==2.6.1 --no-build-isolation

Please add these instructions to Readme and add an "OpenVLA Inference Setup" section

xuanlinli17 · 2024-07-21T16:46:56Z

Typos in scripts/openvla_put_in_drawer_variant_agg.sh and scripts/openvla_put_in_drawer_visual_matching.sh: should be "declare -a ckpt_paths=("openvla/openvla-7b")"

erjiaxiao · 2024-07-22T07:12:43Z

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

Hello @xuanlinli17, I tried the code above but openvla still fails on widowx tasks. Is it possibly an implementation problem? Should I set any params for widowx?

xuanlinli17 · 2024-07-22T19:50:57Z

Yeah that's my finding too, but I don't think the authors did some special treatments to evaluate OpenVLA on Bridge.

There might be some coordinate transforms on Bridge which is different.

erjiaxiao · 2024-07-23T00:41:31Z

@xuanlinli17 Thank you! I will continue to look into this. If I find any additional information or solutions, I'll make sure to share it with you.

RZFan525 · 2024-07-28T05:00:12Z

When I run the scripts bash scripts/openvla_bridge.sh in ./SimplerEnv-OpenVLA, an error appears:

from simpler_env.policies.openvla.openvla_model import OpenVLAInference

ModuleNotFoundError: No module named 'simpler_env.policies.openvla'

How to fix it?

xuanlinli17 · 2024-07-28T16:51:04Z

^ Run pip install -e . in the SimplerEnv-OpenVLA repo. Also I don't think the authors have updated the scripts yet. I'll just add try to push some updates.

hilookas · 2024-07-29T08:11:27Z

I have made a run, here is the full result of vla:

I use my branch as codebase: https://github.com/hilookas/SimplerEnv

(note: the "real" result is set to 0)

(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=pick_coke_can
***Pick coke can results***
--------------------
horizontal sim variant avg success {'openvla': 0.711111111111111}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
vertical sim variant avg success {'openvla': 0.27111111111111114}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
standing sim variant avg success {'openvla': 0.6533333333333333}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
avg_orientation_sim_variant_results [0.5451851851851851]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_variant_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_variant_results, avg_orientation_real_results) 1
--------------------
Orientation horizontal, ckpt openvla all robot arm visual matching success: [0.28, 0.2, 0.32, 0.28]
Orientation vertical, ckpt openvla all robot arm visual matching success: [0.0, 0.08, 0.04, 0.0]
Orientation standing, ckpt openvla all robot arm visual matching success: [0.2, 0.08, 0.36, 0.12]
horizontal visual matching sim success {'openvla': 0.27}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
horizontal kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=7.976744186046492, pvalue=0.004738208143666835)
vertical visual matching sim success {'openvla': 0.03}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
vertical kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999995735, pvalue=0.3173105078630144)
standing visual matching sim success {'openvla': 0.19}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
standing kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.444444444444439, pvalue=0.019630657257290737)
avg_orientation_sim_visual_matching_results [0.16333333333333336]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=12.956521739130327, pvalue=0.0003188089440220541)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=move_near
***Move Near results***
--------------------
sim variant avg success {'openvla': 0.4770833333333333}
real success {'openvla': 0}
MMRV 0.0
pearson correlation 1
--------------------
Ckpt openvla all robot arm visual matching success: [0.5333333333333333, 0.43333333333333335, 0.4, 0.48333333333333334]
sim visual matching success {'openvla': 0.4625}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=36.21739130434784, pvalue=1.7648851059881935e-09)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=drawer
***Drawer results***
--------------------
open sim variant avg success {'openvla': 0.15873015873015872}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
close sim variant avg success {'openvla': 0.19576719576719576}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
avg_sim_variant_results [0.17724867724867724]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_variant_results, avg_real_results) 0.0
pearson_correlation(avg_sim_variant_results, avg_real_results) 1
--------------------
Drawer task open, ckpt openvla all robot arm visual matching success: [0.2222222222222222, 0.2222222222222222, 0.1388888888888889, 0.2222222222222222]
Drawer task close, ckpt openvla all robot arm visual matching success: [0.5555555555555556, 0.5, 0.5, 0.5185185185185186]
open visual matching sim success {'openvla': 0.19444444444444442}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
open kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.408163265306165, pvalue=0.02004279392017093)
close visual matching sim success {'openvla': 0.5185185185185185}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
close kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=18.549999999999994, pvalue=1.655052286653112e-05)
avg_sim_visual_matching_results [0.35648148148148145]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_visual_matching_results, avg_real_results) 0.0
pearson_correlation(avg_sim_visual_matching_results, avg_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=22.842696629213542, pvalue=1.7581614124344239e-06)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=bridge_put_on
***Bridge Put On Env results***
********** Results for put_spoon_on_tablecloth **********
sim visual matching partial success {'openvla': 0.041666666666666664}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_carrot_on_plate **********
sim visual matching partial success {'openvla': 0.3333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=9.399999999999977, pvalue=0.002169854427130147)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for stack_green_block_on_yellow_block **********
sim visual matching partial success {'openvla': 0.125}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=3.133333333333267, pvalue=0.07670675171090896)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_eggplant_in_basket **********
sim visual matching partial success {'openvla': 0.08333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=2.0434782608695747, pvalue=0.15285977016579425)
sim visual matching success {'openvla': 0.041666666666666664}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
********************

xuanlinli17 · 2024-07-29T17:12:00Z

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...

horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

Toradus · 2024-08-03T12:05:45Z

@hilookas Thanks for providing the results!
It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}".
Is it possible for you to also do that evaluation and edit your results list?

QuanyiLi · 2024-08-03T13:06:44Z

@hilookas Thank you for your great work! I want to try out the openVLA in the simulator as well. But I wonder why the performance in the sim is not as good as what is claimed by the paper in the real-world benchmark. It should not be due to the sim-to-real gap, right? Cause SIMPLER is designed to mitigate this gap.

xuanlinli17 · 2024-08-03T15:31:39Z

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc).

QuanyiLi · 2024-08-03T19:43:11Z

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc).

Thanks. Look forward to the updated results!

hilookas · 2024-08-04T08:37:35Z

@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?

Sure! I have update the result. Please see log above.

My result is slightly different but not much from xuanlinli17's run in

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...
horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

hilookas · 2024-08-04T08:41:03Z

Here is updated table:

yxchng · 2024-08-05T14:23:34Z

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

QuanyiLi · 2024-08-05T14:29:53Z

@yxchng It takes me 15G vram for 4090, following the official instructions to use bf16.

hilookas · 2024-08-07T03:16:49Z

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

1x3090 is enough. Just remember don't open env and inference process more than 1.

SiyuanHuang95 · 2024-08-08T04:44:37Z

Here is updated table:

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

hilookas · 2024-08-08T07:37:58Z

Here is updated table:

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

SiyuanHuang95 · 2024-08-09T03:41:11Z

Here is updated table:

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

I do not have enough GPU resources attached to a Screen the Sapien simulator requires. :-(, running OpenVLA locally is quite a burden for most consumer-level PC.

simpler_env/main_inference.py

simpler_env/policies/openvla/openvla_model.py

4evertutelary · 2024-08-20T00:55:05Z

Easy to succeed on google robot task BUT always failed on WidowX related task.
(prompt = "In: What action should the robot take to {}?\nOut:")

@xuanlinli17
Would you mind having a try on WidowX?

xuanlinli17 · 2024-08-20T00:58:55Z

Simpler-OpenVLA on WidowX is known to have some strange behaviors that I don't yet know why... Would you investigate too?

cyanzhao42 · 2024-08-24T01:43:54Z

Here is updated table:

Hi, i am wondering what are the experiment setup to get the numbers in the paper - are those number obtained by running - SimplerEnv/scripts, i.e. for "pick_coke_can_visual_matching" it is basically 100 trial (25*4) for each coke can layout? Thanks!

xuanlinli17 · 2024-08-24T01:58:29Z

Here is updated table:

Hi, i am wondering what are the experiment setup to get the numbers in the paper - are those number obtained by running - SimplerEnv/scripts, i.e. for "pick_coke_can_visual_matching" it is basically 100 trial (25*4) for each coke can layout? Thanks!

Yes.

hahans · 2024-08-24T04:13:32Z

How do I solve this problem

xuanlinli17 · 2024-08-24T04:19:49Z

How do I solve this problem

Please run all scripts under the root SimplerEnv repo.

hahans · 2024-08-24T04:24:49Z

How do I solve this problem

Please run all scripts under the root SimplerEnv repo.

thanks.

hahans · 2024-08-24T04:53:34Z

I run the bash ,but nothing happened.
Is there a configuration problem?

xuanlinli17 · 2024-08-24T05:18:16Z

Please follow the troubleshooting section in readme; if issue still persists, please open a new issue & discussion

hahans · 2024-08-24T06:01:23Z

@xuanlinli17 Maybe I didn't make myself clear, what I meant was that I ran the bash file inside the scripts file, but the robot simulation screen didn't appear. What am I supposed to do? Make the picture appear.

eyucherin · 2024-09-08T13:02:21Z

@xuanlinli17 @DelinQu Hi! I am trying to run openVLA on simple Env and I am getting this error when I run pip install flash-attn==2.6.1 --no-build-isolation

Here is the error message.

Collecting flash-attn==2.6.1
  Using cached flash_attn-2.6.1.tar.gz (2.6 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-ehx9vciz/flash-attn_3fef33cfddd542b995a983488d613f27/setup.py", line 115, in <module>
          raise RuntimeError(
      RuntimeError: FlashAttention is only supported on CUDA 11.6 and above.  Note: make sure nvcc has a supported version by running nvcc -V.


      torch.__version__  = 2.3.1+cu121


      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

and here is my GPU information

nvidia-smi
Sun Sep  8 22:01:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 31%   31C    P8             21W /  450W |     557MiB /  24564MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1177      G   /usr/lib/xorg/Xorg                            295MiB |
|    0   N/A  N/A      1312      G   /usr/bin/gnome-shell                           86MiB |
|    0   N/A  N/A      4067      G   ...seed-version=20240906-130113.352000         73MiB |
|    0   N/A  N/A      8205      G   ...erProcess --variations-seed-version         21MiB |
|    0   N/A  N/A     20334    C+G   warp-terminal                                  31MiB |
+-----------------------------------------------------------------------------------------+

any insight on this would be greatly appreciated!

xuanlinli17 · 2024-09-10T14:35:46Z

^ I think you need a local cuda version >= 11.6, probably match your torch version.

RZFan525 · 2024-10-07T03:16:44Z

When I run openvla_bridge.sh, I encounter an error:
scripts/openvla_bridge.sh: line 18: 92071 Floating point exception(core dumped)
I debug and find that the problem is in def step of class OpenVLAInference:
raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
I don't know why. But It works when I run a simple script to use OpenVLA model:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("/home/rzfan/ckpts/openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "/home/rzfan/ckpts/openvla/openvla-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = Image.open("datasets/traj0/images0/im_0.jpg")
INSTRUCTION = "Put the cucumber next to the banana"
prompt = f"In: What action should the robot take to {INSTRUCTION}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeData V2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
print(inputs)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)

My computer is GTX 4090 and Ubuntu.

DelinQu force-pushed the main branch from 02ce175 to e95842c Compare July 6, 2024 13:50

ZSL98 mentioned this pull request Jul 9, 2024

Simulation for validation openvla/openvla#7

Closed

DelinQu and others added 6 commits August 17, 2024 23:53

openvla intergration

a8e37cb

intergrate openvla policy and PR

7bed3d4

openvla policy intergration pull request

f0674e7

update openvla inference scripts

2905379

update readme

52727f9

Add OpenVLA metrics

37d8fa8

xuanlinli17 force-pushed the main branch 2 times, most recently from 95cd66d to be0543f Compare August 18, 2024 06:57

update readme

3667e65

xuanlinli17 force-pushed the main branch from be0543f to 3667e65 Compare August 18, 2024 06:59

add openvla simple inference

2b0ecc8

xuanlinli17 force-pushed the main branch from 12e3747 to 2b0ecc8 Compare August 18, 2024 16:33

minor readme modification

2cd7aae

4evertutelary reviewed Aug 19, 2024

View reviewed changes

simpler_env/main_inference.py Outdated Show resolved Hide resolved

4evertutelary reviewed Aug 19, 2024

View reviewed changes

simpler_env/policies/openvla/openvla_model.py Outdated Show resolved Hide resolved

xuanlinli17 self-requested a review August 19, 2024 02:48

xuanlinli17 approved these changes Aug 19, 2024

View reviewed changes

openvla policy intergration #10

Are you sure you want to change the base?

openvla policy intergration #10

Conversation

DelinQu commented Jul 5, 2024

xuanlinli17 commented Jul 5, 2024

DelinQu commented Jul 6, 2024

xuanlinli17 commented Jul 6, 2024 • edited Loading

xuanlinli17 commented Jul 6, 2024 • edited Loading

HuFY-dev commented Jul 7, 2024

sombitd commented Jul 8, 2024

xuanlinli17 commented Jul 20, 2024

xuanlinli17 commented Jul 20, 2024

xuanlinli17 commented Jul 21, 2024

erjiaxiao commented Jul 22, 2024

xuanlinli17 commented Jul 22, 2024 • edited Loading

erjiaxiao commented Jul 23, 2024

RZFan525 commented Jul 28, 2024

xuanlinli17 commented Jul 28, 2024

hilookas commented Jul 29, 2024 • edited Loading

xuanlinli17 commented Jul 29, 2024 • edited Loading

Toradus commented Aug 3, 2024

QuanyiLi commented Aug 3, 2024

xuanlinli17 commented Aug 3, 2024 • edited Loading

QuanyiLi commented Aug 3, 2024

hilookas commented Aug 4, 2024 • edited Loading

hilookas commented Aug 4, 2024

yxchng commented Aug 5, 2024

QuanyiLi commented Aug 5, 2024

hilookas commented Aug 7, 2024 • edited Loading

SiyuanHuang95 commented Aug 8, 2024

hilookas commented Aug 8, 2024 • edited Loading

SiyuanHuang95 commented Aug 9, 2024

4evertutelary commented Aug 20, 2024

xuanlinli17 commented Aug 20, 2024

cyanzhao42 commented Aug 24, 2024 • edited Loading

xuanlinli17 commented Aug 24, 2024

hahans commented Aug 24, 2024

xuanlinli17 commented Aug 24, 2024

hahans commented Aug 24, 2024

hahans commented Aug 24, 2024

xuanlinli17 commented Aug 24, 2024 • edited Loading

hahans commented Aug 24, 2024

eyucherin commented Sep 8, 2024 • edited Loading

xuanlinli17 commented Sep 10, 2024

RZFan525 commented Oct 7, 2024

xuanlinli17 commented Jul 6, 2024 •

edited

Loading

xuanlinli17 commented Jul 6, 2024 •

edited

Loading

xuanlinli17 commented Jul 22, 2024 •

edited

Loading

hilookas commented Jul 29, 2024 •

edited

Loading

xuanlinli17 commented Jul 29, 2024 •

edited

Loading

xuanlinli17 commented Aug 3, 2024 •

edited

Loading

hilookas commented Aug 4, 2024 •

edited

Loading

hilookas commented Aug 7, 2024 •

edited

Loading

hilookas commented Aug 8, 2024 •

edited

Loading

cyanzhao42 commented Aug 24, 2024 •

edited

Loading

xuanlinli17 commented Aug 24, 2024 •

edited

Loading

eyucherin commented Sep 8, 2024 •

edited

Loading