2. Qwen2-VL¶
Qwen2-VL是一个基于视觉-语言预训练的多模态模型,支持图像和文本的联合输入,输出是文本形式。

Github地址:https://github.com/QwenLM/Qwen2-VL
将在鲁班猫板卡上部署Qwen2-VL-2B-Instruct模型,对输入图像进行描述。
2.1. Qwen2-VL使用¶
Qwen2-VL系列的2B和7B模型及其量化模型可以在Hugging Face或者ModelScope上找到, 请参考 Qwen/Qwen2-VL-2B-Instruct 。
接下来将使用HF Transformers参考 Qwen/Qwen2-VL-2B-Instruct Model Card中的描述测试Qwen2-VL。 安装测试环境:
# 简单创建一个测试环境
conda create -n qwen2_vl python=3.10
conda activate qwen2_vl
# 自行安装transformers和torch
(qwen2_vl) llh@llh:/xxx$ pip install transformers torch torchvision
# 安装qwen-vl-utils 工具包(可选)
(qwen2_vl) llh@llh:/xxx$ pip install qwen-vl-utils
拉取模型文件:
# 安装git-lfs
git lfs install
# 获取Qwen2-VL-2B-Instruct模型文件
git clone https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
# 从镜像网址获取(可选)
git clone https://hf-mirror.com/Qwen/Qwen2-VL-2B-Instruct
# 或者获取Qwen2-VL-7B-Instruct模型文件
git clone https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
# 从镜像网址获取(可选)
git clone https://hf-mirror.com/Qwen/Qwen2-VL-7B-Instruct
测试程序,可以自行修改图像路径等等:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
path = "path/to/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(path)
# Image
#url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
#image = Image.open(requests.get(url, stream=True).raw)
image = Image.open('./data/demo.jpg')
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "描述这幅图像"},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>
# Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
|
执行程序,将输出对图像的文本描述:
# 修改程序中模型的路径为前面拉取模型文件的路径
(qwen2_vl) llh@llh:/xxx$ python infer.py
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [01:16<00:00, 38.45s/it]
['这幅图像描绘了一位宇航员在月球表面休息的场景。宇航员穿着白色的宇航服,坐在一个绿色的冷藏箱旁边,手里拿着一瓶绿色的啤酒。
背景中可以看到地球和星空,显示出宇航员在月球上的孤独和宁静。
月球表面的岩石和沙子,以及远处的地球,都清晰可见。整体画面充满了科幻和探索的氛围。']
2.2. 模型转换¶
Qwen2-VL模型是ViT加Qwen2的串联结构,为了部署模型,将模型分成两部分。模型转换程序参考 rknn-llm 工程文件中的例程。
2.2.1. 导出vision的onnx模型¶
# 获取rknn-llm
git clone https://github.com/airockchip/rknn-llm
# 切换到example目录下
cd rknn-llm/examples/Qwen2-VL-2B_Demo
# 注意修改程序中模型路径
path = 'path/Qwen2-VL-2B-Instruct'
(qwen2_vl) llh@llh:/xxx$ python export/export_vision.py
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.16s/it]
#......省略
会在onnx目录下生成qwen2_vl_2b_vision.onnx文件。
2.2.2. 转换成rknn模型¶
将前面生成的qwen2_vl_2b_vision.onnx文件经过Toolkit2工具导出rknn模型, rknn-Toolkit2环境安装参考下前面 Toolkit2章节 。
# 教程测试lubancat-4,如果是鲁班猫3需要修改target_platform = "rk3576"和模型路径model_path
(toolkit2_2.3)llh@llh:/xxx$ python export/export_vision_rknn.py
I rknn-toolkit2 version: 2.3.0
I Loading : 100%|█████████████████████████████████████████████████| 551/551 [00:22<00:00, 24.37it/s]
I OpFusing 0: 100%|██████████████████████████████████████████████| 100/100 [00:00<00:00, 174.66it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:02<00:00, 34.42it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:11<00:00, 8.75it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:12<00:00, 8.27it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:14<00:00, 6.82it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:14<00:00, 6.75it/s]
I OpFusing 2 : 100%|██████████████████████████████████████████████| 100/100 [00:14<00:00, 6.68it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:15<00:00, 6.50it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:15<00:00, 6.43it/s]
I OpFusing 2 : 100%|██████████████████████████████████████████████| 100/100 [00:31<00:00, 3.13it/s]
I Saving : 100%|██████████████████████████████████████████████████| 295/295 [00:16<00:00, 18.32it/s]
I rknn building ...
I rknn building done.
教程测试lubancat-4,将会在rknn目录下生成qwen2_vl_2b_vision_rk3588.rknn文件。
2.2.3. 导出rkllm模型¶
使用rkllm-toolkit工具导出rkllm模型,rkllm-toolkit环境安装参考下 前面 RKLLM章节 或者查看 Rockchip_RKLLM_SDK_CN_xxx.pdf 。
先对原始数据进行处理,将原始的json格式转化为模型接受的形式,作为量化数据。
# 修改path为前面拉取Qwen2-VL-2B-Instruct模型文件的路径
(rkllm_1.1.4) llh@llh:/xxx$ python data/make_input_embeds_for_quantize.py
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.84it/s]
inputs_embeds torch.Size([1, 249, 1536])
#......省略
inputs_embeds torch.Size([1, 227, 1536])
inputs_embeds torch.Size([1, 280, 1536])
inputs_embeds torch.Size([1, 300, 1536])
inputs_embeds torch.Size([1, 334, 1536])
inputs_embeds torch.Size([1, 386, 1536])
100%|██████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00, 2.46it/s]
Done
导出rkllm模型,注意修改模型的路径和目标平台。
# 修改模型路径和llm.build的target_platform='rk3588'和quantized_dtype='w8a8'
# 教程测试lubancat-4, 如果是使用鲁班猫3,需要修改:
ret = llm.build(do_quantization=True, optimization_level=1, quantized_dtype='w4a16',
quantized_algorithm='normal', target_platform='rk3576', num_npu_core=2, extra_qparams=qparams, dataset=dataset)
# 执行程序加载模型导出rkllm模型
(rkllm_1.1.4) llh@llh:/xxx$ python export/export_rkllm.py
INFO: rkllm-toolkit version: 1.1.4
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:24<00:00, 12.20s/it]
WARNING: rkllm-toolkit only exports the language model of Qwen2VL!
Optimizing model: 100%|███████████████████████████████████████████████████████████████████████████████| 28/28 [01:03<00:00, 2.27s/it]
Building model: 100%|█████████████████████████████████████████████████████████████████████████████████| 399/399 [00:07<00:00, 53.74it/s]
INFO: The token_id of eos is set to 151645
INFO: The token_id of pad is set to 151643
INFO: The token_id of bos is set to 151643
Converting model: 100%|████████████████████████████████████████████████████████████████████| 339/339 [00:00<00:00, 3590578.42it/s]
INFO: Exporting the model, please wait ....
[=================================================>] 597/597 (100%)
INFO: Model has been saved to ./Qwen2-VL-2B-Instruct.rkllm!
2.3. 部署测试¶
板卡上获取rkllm工程文件:
# 板卡上获取测试例程(教程测试lubancat-4)
git clone https://github.com/airockchip/rknn-llm
# 待加
# 切换到例程目录
cd rknn-llm/examples/Qwen2-VL-2B_Demo/deploy
板卡上编译测试例程:
# 本地编译,改build-linux.sh中编译器
GCC_COMPILER=aarch64-linux-gnu
cat@lubancat:~/rknn-llm/examples/Qwen2-VL-2B_Demo/deploy$ ./build-linux.sh
-- The C compiler identification is GNU 10.2.1
-- The CXX compiler identification is GNU 10.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/aarch64-linux-gnu-gcc - skipped
# 省略..............
[ 70%] Linking CXX executable llm_test
[ 70%] Built target llm_test
[ 80%] Linking CXX executable llm
[ 80%] Built target llm
[ 90%] Linking CXX executable imgenc
[100%] Linking CXX executable demo
[100%] Built target imgenc
[100%] Built target demo
[ 30%] Built target demo
[ 50%] Built target llm_test
[ 70%] Built target llm
[100%] Built target imgenc
# 省略..............
切换到install/demo_Linux_aarch64/目录下执行程序demo例程,用户输入“<image>请描述图像”,例程将对下面的测试图像进行描述:

# 将前面导出的qwen2_vl_2b_vision_rk3588.rknn和Qwen2-VL-2B-Instruct.rkllm模型传输到板卡
# 执行demo程序
# Usage: ./demo image_path encoder_model_path llm_model_path max_new_tokens max_context_len
cat@lubancat:~/xxx/install/demo_Linux_aarch64$ export LD_LIBRARY_PATH=./lib
cat@lubancat:~/xxx/install/demo_Linux_aarch64$ ./demo demo.jpg ~/qwen2_vl_2b_vision_rk3588.rknn ~/Qwen2-VL-2B-Instruct.rkllm 128 512
I rkllm: rkllm-runtime version: 1.1.4, rknpu driver version: 0.9.8, platform: RK3588
rkllm init success
main: LLM Model loaded in 2637.91 ms
model input num: 1, output num: 1
input tensors:
index=0, name=onnx::Expand_0, n_dims=4, dims=[1, 392, 392, 3], n_elems=460992, size=921984, fmt=NHWC, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=6076, n_dims=2, dims=[196, 1536, 0, 0], n_elems=301056, size=602112, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
model input height=392, width=392, channel=3
main: ImgEnc Model loaded in 2297.09 ms
user: <image>请描述图像
robot: 这张图片展示了一位宇航员在月球表面的场景。宇航员穿着白色的太空服,戴着头盔和手套,正在休息或放松。
他手中拿着一个绿色的瓶子,似乎在喝着饮料。背景是广阔的月球表面,可以看到一些岩石和沙子。
远处可以看到地球和其他星星,给人一种置身于宇宙中的感觉。整体画面充满了科幻元素和探索太空的主题。