-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
95 changed files
with
480 additions
and
6 deletions.
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# qwen vl | ||
|
||
论文链接:[Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (arxiv.org)](https://arxiv.org/abs/2308.12966 "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (arxiv.org)") | ||
|
||
## 0.摘要 | ||
|
||
- Qwen-VL系列模型在多种视觉中心任务上设立了新的记录,包括图像描述、问答、视觉定位等。 | ||
- 该模型支持多语言对话、多图像输入、文本阅读和定位等功能。 | ||
|
||
## 1.引言 | ||
|
||
- 大型语言模型(LLMs)在文本生成和理解方面的强大能力,但缺乏处理图像、语音和视频等其他常见模态的能力。 | ||
- 为了解决这个问题,研究者开发了大型视觉-语言模型(LVLMs),以增强LLMs的感知和理解视觉信号的能力。 | ||
|
||
## 2.模型架构 | ||
|
||
Qwen-VL模型的整体神经网络结构由三个主要组件构成:**大型语言模型(Large Language Model, LLM)**、**视觉编码器(Visual Encoder)**和**位置感知视觉-语言适配器(Position-aware Vision-Language Adapter)**。 | ||
|
||
### 2.1 大型语言模型 (LLM) | ||
|
||
- Qwen-VL采用一个大型预训练的语言模型作为其基础组件。 | ||
- 该模型使用Qwen-7B语言模型的预训练权重进行初始化。 | ||
|
||
### 2.2 视觉编码器 (Visual Encoder) | ||
|
||
- Qwen-VL的视觉编码器使用Vision Transformer (ViT)架构。 | ||
- 视觉编码器使用OpenCLIP的ViT-bigG的预训练权重进行初始化。 | ||
- 在训练和推理过程中,**输入图像被调整到特定的分辨率,然后被分割成小块(patches)**,生成一组图像特征。 | ||
|
||
### 2.3 位置感知视觉-语言适配器 (Position-aware Vision-Language Adapter) | ||
|
||
- 为了解决长图像特征序列带来的效率问题,Qwen-VL引入了一个视觉-语言适配器,该适配器通过交叉注意机制压缩图像特征。 | ||
- 该适配器由单个交叉注意模块组成,该模块使用一组可训练的向量(嵌入)作为查询向量,使用视觉编码器生成的图像特征作为键(key)进行交叉注意操作。 | ||
- 通过这种机制,将视觉特征序列压缩成长度为256的固定长度。压缩后的特征序列随后输入到大型语言模型中。 | ||
|
||
### 2.4 输入输出接口 | ||
|
||
- 图像输入:图像通过视觉编码器和适配器处理,生成固定长度的图像特征序列。 | ||
- 文本输入:与图像特征序列区分开来,通常使用特殊的标记(如 `<img>` 和 `</img>`)来标识图像内容的开始和结束。 | ||
|
||
### 2.5模型参数 | ||
|
||
- 视觉编码器(ViT):约1.9B个参数。 | ||
- 视觉-语言适配器(VL Adapter):约0.08B个参数。 | ||
- 大型语言模型(LLM):约7.7B个参数。 | ||
- 总计:约9.6B个参数。 | ||
|
||
## 3.训练细节 | ||
|
||
Qwen-VL模型的训练过程包括三个主要阶段:**两个预训练阶段**和**指令微调(Supervised Fine-tuning)阶段**。 | ||
|
||
![](image/image_OkUkvkxhfu.png) | ||
|
||
### 3.1 预训练(第一阶段) | ||
|
||
- **数据集**:使用大规模、弱标注的网络爬虫图像-文本对,包括多个公开可访问的来源和一些内部数据。 | ||
- **数据清洗**:原始数据集包含50亿图像-文本对,清洗后剩余14亿数据,其中77.3%为英文数据,22.7%为中文数据。 | ||
- **模型组件**:在此阶段,**大语言模型(LLM)参数被冻结,只优化视觉编码器(ViT)和视觉-语言适配器(VL Adapter)**。 | ||
- **输入图像尺寸**:输入图像被调整至224×224分辨率。 | ||
- **训练目标**:最小化文本令牌的交叉熵。 | ||
- **学习率**:最大学习率为2e-4。 | ||
- **训练步骤**:整个预训练阶段持续50,000步,使用30720的批次大小处理大约15亿图像-文本样本。 | ||
|
||
### 3.2 多任务预训练(第二阶段) | ||
|
||
- **数据集**:引入高质量、细粒度的视觉语言注释数据,并使用更高的输入分辨率和交错的图像-文本数据。 | ||
- **任务**:同时训练Qwen-VL模型在7个任务上,包括图像字幕、视觉问答(VQA)、文本生成、文本定向的视觉问答、视觉定位、基于参考的定位和基于参考的字幕。 | ||
- **输入图像尺寸**:视觉编码器的输入分辨率从224×224增加到448×448,以减少图像下采样造成的信息损失。 | ||
- **模型组件**:**解锁大语言模型,训练整个模型。** | ||
- **训练目标**:与预训练阶段相同。 | ||
|
||
### 3.3 指令微调(第三阶段) | ||
|
||
- **目的**:通过指令微调来增强Qwen-VL预训练模型的指令跟随和对话能力,产生交互式的Qwen-VL-Chat模型。 | ||
- **数据**:多模态指令调整数据主要来自通过LLM自指令生成的字幕数据或对话数据,以及通过手动注释、模型生成和策略串联构建的额外对话数据集。 | ||
- **模型组件**:在这个阶段,**冻结视觉编码器,优化语言模型和适配器模块**。 | ||
- **数据量**:指令调整数据量达到350k。 | ||
- **训练目标**:确保模型在对话能力上的通用性,通过在训练中混合多模态和纯文本对话数据。 | ||
|
||
### 3.4 训练细节 | ||
|
||
- **优化器**:在所有训练阶段中,使用AdamW优化器,具有特定的β1、β2和ε参数。 | ||
- **学习率调度**:采用余弦衰减学习率计划。 | ||
- **权重衰减**:设置为0.05。 | ||
- **梯度裁剪**:应用1.0的梯度裁剪。 | ||
- **批次大小和梯度累积**:根据训练阶段调整批次大小和梯度累积。 | ||
- **数值精度**:使用bfloat16数值精度。 | ||
- **模型并行性**:在第二阶段使用模型并行性技术。 | ||
|
||
## 4.数据格式 | ||
|
||
### 4.1 多任务训练数据格式 | ||
|
||
下图包含所有 7 个任务,其中黑色文本作为不计算损失的前缀序列,蓝色文本作为有损失的真实标签。 | ||
|
||
![](image/image_sszW6nCzYk.png) | ||
|
||
### 4.2 微调数据格式 | ||
|
||
为了更好地适应多图像对话和多个图像输入,在不同图像之前添加字符串`Picture id:`,其中id对应于图像输入对话的顺序。在对话格式方面,使用 ChatML (Openai) 格式构建指令调整数据集,其中每个交互的语句都标有两个特殊标记(`<im_start>` 和 `<im_end>`),以方便对话终止。 | ||
|
||
![](image/image_o5Q33jgUeD.png) | ||
|
||
在训练过程中,通过仅监督答案和特殊标记(示例中的蓝色)而不监督角色名称或问题提示来确保预测和训练分布之间的一致性。 |
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Oops, something went wrong.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Oops, something went wrong.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Oops, something went wrong.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
# 基于LoRA微调多模态大模型 | ||
|
||
本文将结合使用 LoRA 来训练用于图生文的blip2-opt-2.7b模型。 | ||
|
||
## 1.数据集和模型准备 | ||
|
||
数据集使用6名足球运动员的虚拟数据集,带有可用于微调任何图像描述模型的文字说明。数据集下载地址:[https://huggingface.co/datasets/ybelkada/football-dataset](https://huggingface.co/datasets/ybelkada/football-dataset "https://huggingface.co/datasets/ybelkada/football-dataset") | ||
|
||
![](image/image_63YbVoHkHF.png) | ||
|
||
模型为利用 OPT-2.7B 训练的 BLIP-2 模型,其由三个模型组成,下面会详细介绍,模型下载地址:[https://huggingface.co/Salesforce/blip2-opt-2.7b](https://huggingface.co/Salesforce/blip2-opt-2.7b "https://huggingface.co/Salesforce/blip2-opt-2.7b") | ||
|
||
## 2.BLIP-2 简介 | ||
|
||
**BLIP-2** 通过利用预训练的视觉模型和语言模型来提升多模态效果和降低训练成本,预训练的视觉模型能够提供高质量的视觉表征,预训练的语言模型则提供了强大的语言生成能力。如下图所示,由一个预训练的 **Image Encoder**,一个预训练的 **Large Language Model** 和一个可学习的 **Q-Former** 组成。 | ||
|
||
![](image/image_TmufTgOGdk.png) | ||
|
||
- **Image Encoder**:负责从输入图片中提取视觉特征。 | ||
- **Large Language Model**:负责文本生成。 | ||
- **Q-Former**:负责弥合视觉和语言两种模态的差距,由**Image Transformer**和**Text Transformer**两个子模块构成,它们共享相同自注意力层,如下图所示。 | ||
- **Image Transformer**通过与图像编码器进行交互提取视觉特征,它的输入是可学习的 Query,这些Query通过自注意力层相互交互,并通过交叉注意力层与冻结的图像特征交互,还可以通过共享的自注意力层与文本进行交互。 | ||
- **Text Transformer**作为文本编码器和解码器,它的自注意力层与Image Transformer共享,根据预训练任务,应用不同的自注意力掩码来控制Query和文本的交互方式。 | ||
|
||
![](image/image_y7uRZzOPqC.png) | ||
|
||
为了减少计算成本并避免灾难性遗忘的问题,**BLIP-2 在预训练时冻结预训练图像模型和语言模型**,但是,简单地冻结预训练模型参数会导致视觉特征和文本特征难以对齐,为此BLIP-2提出两阶段预训练 Q-Former 来弥补模态差距:**表示学习阶段和生成学习阶段**。 | ||
|
||
### (1)**表示学习阶段** | ||
|
||
在表示学习阶段,**将 Q-Former 连接到冻结的 Image Encoder**,训练集为`图像-文本`对,通过联合优化三个预训练目标,在Query和Text之间分别采用不同的注意力掩码策略,从而控制Image Transformer和Text Transformer的交互方式。 | ||
|
||
### (2)**生成学习阶段** | ||
|
||
在生成预训练阶段,**将 Q-Former连接到冻结的 LLM,以利用 LLM 的语言生成能力**。这里使用全连接层将输出的 Query 嵌入线性投影到与 LLM 的文本嵌入相同的维度,然后,将投影的Query嵌入添加到输入文本嵌入前面。由于 Q-Former 已经过预训练,可以提取包含语言信息的视觉表示,因此它,可以有效地充当信息瓶颈,将最有用的信息提供给 LLM,同时删除不相关的视觉信息,减轻了 LLM 学习视觉语言对齐的负担。 | ||
|
||
![](image/image_hxxuMVrmOt.png) | ||
|
||
该模型可用于如下的任务:图像描述生成、视觉问答任务、通过将图像和之前的对话作为提示提供给模型来进行类似聊天的对话。 | ||
|
||
先预先准备Processor、模型和图像输入。 | ||
|
||
```python | ||
from PIL import Image | ||
import requests | ||
from transformers import Blip2Processor, Blip2ForConditionalGeneration | ||
import torch | ||
|
||
device = "cuda" if torch.cuda.is_available() else "cpu" | ||
|
||
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") | ||
model = Blip2ForConditionalGeneration.from_pretrained( | ||
"Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0}, torch_dtype=torch.float16 | ||
) # doctest: +IGNORE_RESULT | ||
|
||
url = "http://images.cocodataset.org/val2017/000000039769.jpg" | ||
image = Image.open(requests.get(url, stream=True).raw) | ||
``` | ||
|
||
对于图像描述生成任务示例如下: | ||
|
||
```python | ||
inputs = processor(images=image, return_tensors="pt").to(device, torch.float16) | ||
|
||
generated_ids = model.generate(**inputs) | ||
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() | ||
print(generated_text) | ||
|
||
# two cats laying on a couch | ||
``` | ||
|
||
对于视觉问答任务(VQA)示例如下: | ||
|
||
```python | ||
prompt = "Question: how many cats are there? Answer:" | ||
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device="cuda", dtype=torch.float16) | ||
|
||
generated_ids = model.generate(**inputs) | ||
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() | ||
print(generated_text) | ||
# two | ||
``` | ||
|
||
## 3.LoRA 简介 | ||
|
||
LoRA方法的核心思想就是通过低秩分解来模拟参数的改变量,从而以极小的参数量来实现大模型的间接训练。 | ||
|
||
## 4.模型微调 | ||
|
||
第一步,加载预训练Blip-2模型以及processor。 | ||
|
||
```python | ||
from transformers import AutoModelForVision2Seq, AutoProcessor | ||
|
||
# We load our model and processor using `transformers` | ||
model = AutoModelForVision2Seq.from_pretrained(pretrain_model_path, load_in_8bit=True) | ||
processor = AutoProcessor.from_pretrained(pretrain_model_path) | ||
``` | ||
|
||
第二步,创建 LoRA 微调方法对应的配置;同时,通过调用 `get_peft_model` 方法包装基础的 Transformer 模型。 | ||
|
||
```python | ||
from peft import LoraConfig, get_peft_model | ||
|
||
# Let's define the LoraConfig | ||
config = LoraConfig( | ||
r=16, | ||
lora_alpha=32, | ||
lora_dropout=0.05, | ||
bias="none", | ||
) | ||
|
||
# Get our peft model and print the number of trainable parameters | ||
model = get_peft_model(model, config) | ||
model.print_trainable_parameters() | ||
``` | ||
|
||
第三步,进行模型微调。 | ||
|
||
```python | ||
# 设置优化器 | ||
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) | ||
device = "cuda"if torch.cuda.is_available() else"cpu" | ||
model.train() | ||
for epoch in range(11): | ||
print("Epoch:", epoch) | ||
for idx, batch in enumerate(train_dataloader): | ||
input_ids = batch.pop("input_ids").to(device) | ||
pixel_values = batch.pop("pixel_values").to(device, torch.float16) | ||
|
||
outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=input_ids) | ||
loss = outputs.loss | ||
print("Loss:", loss.item()) | ||
loss.backward() | ||
optimizer.step() | ||
optimizer.zero_grad() | ||
|
||
if idx % 10 == 0: | ||
# 根据图像生成文本 | ||
generated_output = model.generate(pixel_values=pixel_values) | ||
# 解码 | ||
print(processor.batch_decode(generated_output, skip_special_tokens=True)) | ||
``` | ||
|
||
最后,保存训练的Adapter模型权重及配置文件。 | ||
|
||
```python | ||
model.save_pretrained(peft_model_id) | ||
``` | ||
|
||
## 5.模型推理 | ||
|
||
为了不影响阅读体验,详细的代码放置在GitHub:llm-action 项目中blip2\_lora\_inference.py文件。直接运行`CUDA_VISIBLE_DEVICES=0 python blip2_lora_inference.py`即可完成图生文。 | ||
|
||
```python | ||
|
||
import torch | ||
from datasets import load_dataset | ||
from torch.utils.data import DataLoader, Dataset | ||
from transformers import AutoModelForVision2Seq, AutoProcessor | ||
|
||
from peft import LoraConfig, get_peft_model | ||
from peft import PeftModel, PeftConfig | ||
import torch | ||
import requests | ||
from PIL import Image | ||
from transformers import Blip2Processor, Blip2ForConditionalGeneration | ||
|
||
|
||
|
||
peft_model_id = "/workspace/output/multimodal/blip2" | ||
config = PeftConfig.from_pretrained(peft_model_id) | ||
processor = Blip2Processor.from_pretrained(config.base_model_name_or_path) | ||
|
||
model = AutoModelForVision2Seq.from_pretrained(config.base_model_name_or_path, load_in_8bit=True, device_map="auto") | ||
model = PeftModel.from_pretrained(model, peft_model_id) | ||
|
||
|
||
train_dataset_path = "/workspace/data/pytorch_data/multimodal/blip2/ybelkada___football-dataset/default-80f5618dafa96df9/0.0.0/0111277fb19b16f696664cde7f0cb90f833dec72db2cc73cfdf87e697f78fe02" | ||
|
||
dataset = load_dataset(train_dataset_path, split="train") | ||
|
||
# Let's load the dataset here! | ||
#dataset = load_dataset("ybelkada/football-dataset", split="train") | ||
|
||
|
||
item = dataset[0] | ||
|
||
device = "cuda" if torch.cuda.is_available() else "cpu" | ||
model.eval() | ||
|
||
|
||
encoding = processor(images=item["image"], padding="max_length", return_tensors="pt") | ||
# remove batch dimension | ||
encoding = {k: v.squeeze() for k, v in encoding.items()} | ||
encoding["text"] = item["text"] | ||
|
||
print(encoding.keys()) | ||
|
||
processed_batch = {} | ||
for key in encoding.keys(): | ||
if key != "text": | ||
processed_batch[key] = torch.stack([example[key] for example in [encoding]]) | ||
else: | ||
text_inputs = processor.tokenizer( | ||
[example["text"] for example in [encoding]], padding=True, return_tensors="pt" | ||
) | ||
processed_batch["input_ids"] = text_inputs["input_ids"] | ||
processed_batch["attention_mask"] = text_inputs["attention_mask"] | ||
|
||
|
||
pixel_values = processed_batch.pop("pixel_values").to(device, torch.float16) | ||
print("----------") | ||
generated_output = model.generate(pixel_values=pixel_values) | ||
print(processor.batch_decode(generated_output, skip_special_tokens=True)) | ||
``` |
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
<!-- _navbar.md --> | ||
|
||
* 快捷链接 | ||
- [Tiny LLM zh](https://github.com/wdndev/tiny-llm-zh) | ||
- [体验Tiny LLM](https://www.modelscope.cn/studios/wdndev/tiny_llm_92m_demo/summary) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
* [首页](/) | ||
* [01.Sora](/01.Sora/) | ||
* [1.sora技术原理解析](/01.Sora/1.sora技术原理解析.md) | ||
* [2.transformers_diffusion论文](/01.Sora/2.transformers_diffusion论文.md) | ||
* [3.训练Sore准备工作](/01.Sora/3.训练Sore准备工作.md) | ||
* [02.mllm论文](/02.mllm论文/) | ||
* [0.从视觉表征到多模态大模型](/02.mllm论文/0.从视觉表征到多模态大模型.md) | ||
* [1.Qwen VL](/02.mllm论文/1.qwen_vl.md) | ||
* [03.finetune](/03.finetune/) | ||
* [1.基于LoRA微调多模态大模型](03.finetune/1.基于LoRA微调多模态大模型.md) |
Oops, something went wrong.