diff --git a/multimodal-gpt.md b/multimodal-gpt.md
new file mode 100644
index 0000000..52a0a09
--- /dev/null
+++ b/multimodal-gpt.md
@@ -0,0 +1,40 @@
+# MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
+
+## Overview
+MultiModal-GPT represents a breakthrough in multimodal conversational AI, introducing a unified framework that maintains contextual awareness across multiple conversation turns while processing both images and text. Unlike previous vision-language models limited to single image-text pairs, this architecture enables coherent dialogue incorporating multiple images within a single conversation thread.
+
+## Technical Details
+- **Paper:** [MultiModal-GPT: A Vision and Language Model for Dialogue with Humans](https://arxiv.org/abs/2305.04790)
+- **GitHub:** [open-mmlab/Multimodal-GPT](https://github.com/open-mmlab/Multimodal-GPT)
+- **Released:** May 2023
+- **License:** Apache 2.0
+
+### Architecture
+- Vision Encoder: CLIP ViT-L/14
+- Language Model: ChatGPT
+- Training Dataset: MultiModal-GPT-Instruct (147K multi-turn conversations)
+
+### Key Features
+- Multi-turn dialogue with persistent image context
+- Zero-shot generalization capabilities
+- Flexible deployment options
+- Comprehensive evaluation framework
+
+## Implementation Example
+```python
+from mmgpt.model import MultiModalGPT
+
+# Initialize model
+model = MultiModalGPT.from_pretrained("openmmlab/MultiModal-GPT")
+
+# Single image dialogue
+response = model.chat(
+    text="What do you see in this image?",
+    image_path="example.jpg"
+)
+
+# Multi-image dialogue
+response = model.chat(
+    text="Compare these two images",
+    image_paths=["image1.jpg", "image2.jpg"]
+)