This project is my implementation of a Generalist Robotics Policy (GRP) using a Vision Transformer (ViT) architecture. Inspired by existing approaches but built from the ground up, this model processes multiple input types such as images, text goals, and goal images to generate continuous action outputs for robotic control. While drawing inspiration from established concepts, I've recreated this implementation to deepen my understanding of GRP architectures. This is basically a mini and very basic version of octo.
- Clone the repository
- Create and activate a conda environment:
conda create -n grp python=3.10 conda activate grp
- Install PyTorch with CUDA support:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
- Install additional dependencies:
pip install torch==2.4.0 pip install hydra-submitit-launcher --upgrade pip install decorator==4.4.2 moviepy==1.0.3
- Install the required project dependencies:
pip install -r requirements.txt
- Run the main script:
python src/main.py
- Complete debugging of the main training loop
- Create evaluation in a simulation environment
- Debug evaluation in a simulation environment
- Visualization tools
- Incorporate Diffusion Models
- Use Text Tokenization like BPE