News | Introduction | Preparation | Training | Demo | Acknowledgement | Statement
- [Sep 21 2024]: 🎉🎉🎉🎉🎉 We are excited to release LHRS-Bot-Nova, a powerful new model with an upgraded base model, enriched training data, and an improved training recipe. Please check out the nova branch here and follow the instructions to start using the model. The model checkpoint is available on huggingface.
- [Jul 15 2024]: We updated our paper at arxiv.
- [Jul 12 2024]: We post the missing part in our paper (some observations, considerations, and lessons) in Zhihu (In Chinese, please contact us if you need English version).
- [Jul 09 2024]: We have released our evaluation benchmark LHRS-Bench.
- [Jul 02 2024]: Our paper has been accepted by ECCV 2024! We have open-sourced our training script and training data. Please follow the training instruction belown and data preparation.
- [Feb 21 2024]: We have updated our evaluation code. Any advice are welcom!
- [Feb 7 2024]: Model weights are now available on both Google Drive and Baidu Disk.
- [Feb 6 2024]: Our paper now is available at arxiv.
- [Feb 2 2024]: We are excited to announce the release of our code and model checkpoint! Our dataset and training recipe will be update soon!
We are excited to introduce LHRS-Bot, a multimodal large language model (MLLM) that leverages globally available volunteer geographic information (VGI) and remote sensing images (RS). LHRS-Bot demonstrates a deep understanding of RS imagery and possesses the capability for sophisticated reasoning within the RS domain. In this repository, we will release our code, training framework, model weights, and dataset!
-
Clone this repository.
git clone [email protected]:NJU-LHRS/LHRS-Bot.git cd LHRS-Bot
-
Create a new virtual environment
conda create -n lhrs python=3.10 conda activate lhrs
-
Install dependences and our package
pip install -e .
-
LLaMA2-7B-Chat
-
Automaticaly download:
Our framework is designed to automatically download the checkpoint when you initiate training or run a demo. However, there are a few preparatory steps you need to complete:
-
Request the LLaMA2-7B-Chat models from Hugging Face.
-
After your request been processed, login to huggingface using your personal access tokens:
huggingface-cli login (Then paste your access token and press Enter)
-
Done!
-
-
Mannually download:
-
Download all the files from HuggingFace.
-
Change the following line of each file to your downloaded directory:
-
/Config/multi_modal_stage{1, 2, 3}.yaml
... text: ... path: "" # TODO: Direct to your directory ...
-
/Config/multi_modal_eval.yaml
... text: ... path: "" # TODO: Direct to your directory ...
-
-
-
-
LHRS-Bot Checkpoints:
Staeg1 Stage2 Stage3 Baidu Disk, Google Drive Baidu Disk, Google Drive Baidu Disk, Google Drive -
⚠️ Ensure that theTextLoRA
folder is located in the same directory asFINAL.pt
. The nameTextLoRA
should remain unchanged. Our framework will automatically detect the version perceiver checkpoint and, if possible, load and merge the LoRA module. -
Development Checkpoint:
We will continually update our model with advanced techniques. If you're interested, feel free to download it and have fun :)
Development Baidu Disk, Google Drive
-
-
Prepare and reformat your data following the instruction from here.
-
Stage1
- Fill the
OUTPUT_DIR
andDATA_DIR
of script1. cd Script; bash train_stage1.sh
- Fill the
-
Stage2
- Fill the
OUTPUT_DIR
andDATA_DIR
of script1 - Fill the
MODEL_PATH
for loading the stage1' checkpoint cd Script; bash train_stage2.sh
- Fill the
-
Stage3 is same as Stage2 except for different folder and script (here).
-
Online Web UI demo with gradio:
python lhrs_webui.py \ -c Config/multi_modal_eval.yaml \ # config file --checkpoint-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --server-port 8000 \ # change if you need --server-name 127.0.0.1 \ # change if you need --share # if you want to share with other
-
Command line demo:
python cli_qa.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --image-file ${TheImagePathYouWantToChat} \ # path to image file (Only Single Image File is supported) --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --temperature 0.4 \ --max-new-tokens 512
-
Inference:
-
Classification
python main_cls.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${ImageFolder} \ # path to classification image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 4 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 8 \
-
Visual Grounding
python main_vg.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${ImageFolder} \ # path to image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 2 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 1 \ # It's better to use batchsize 1, since we find batch inference --data-target ${ParsedLabelJsonPath} # is not stable.
-
Visual Question Answering
python main_vqa.py \ -c Config/multi_modal_eval.yaml \ # config file --model-path ${PathToCheckpoint}.pt \ # path to checkpoint end with .pt --data-path ${Image} \ # path to image folder --accelerator "gpu" \ # change if you need ["mps", "cpu", "gpu"] --workers 2 \ --enabl-amp True \ --output ${YourOutputDir} # Path to output (result, metric etc.) --batch-size 1 \ # It's better to use batchsize 1, since we find batch inference --data-target ${ParsedLabelJsonPath} # is not stable. --data-type "HR" # choose from ["HR", "LR"]
-
-
If you find our work is useful, please give us 🌟 in GitHub and consider cite our paper:
@InProceedings{10.1007/978-3-031-72904-1_26, author="Muhtar, Dilxat and Li, Zhenshi and Gu, Feng and Zhang, Xueliang and Xiao, Pengfeng", title="LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model", booktitle="Computer Vision -- ECCV 2024", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="440--457", isbn="978-3-031-72904-1" } @article{li2024lhrs, title={LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation}, author={Li, Zhenshi and Muhtar, Dilxat and Gu, Feng and Zhang, Xueliang and Xiao, Pengfeng and He, Guangjun and Zhu, Xiaoxiang}, journal={arXiv preprint arXiv:2411.09301}, year={2024} }
-
Licence: Apache