5. YOLOV3目标检测

YOLOV3 是一种目标检测算法,是YOLO目标检测算法的一个重要版本。 YOLOv3在YOLOv2的基础上,对网络的主干进行了改良, 利用多尺度特征图进行检测, 并改进了多个独立的Logistic回归分类器来取代softmax来预测类别分类。

yolov3论文实现使用Darknet,Darknet是一个基于C和CUDA的开源深度学习框架。 使用Darknet实现的yolov3代码仓库在 https://github.com/pjreddie/darknet/tree/master

5.1. 环境准备

TPU-MLIR的环境搭建参考 前面章节

创建工作目录,获取教程测试模型文件:

# 在工作目录下创建一个yolov3
root@cfa2f8976af9:/workspace# mkdir yolov3 && cd yolov3
root@cfa2f8976af9:/workspace/yolov3#

拉取配套例程的模型文件:

# 获取yolov3_416.prototxt和yolov3_416.caffemodel文件
root@cfa2f8976af9:/workspace/yolov3# wget xxx //待加

可以自行获取Darknet模型(或者自行训练Darknet yolov3模型),然后将其转为caffe模型文件,该部分可以搜索下网络上相关文档。

然后复制量化所需的数据集文件到Reset18工作目录下(非必须):

# 拉取tpu-mlir源码,如果之前有拉取过,可以忽略
root@cfa2f8976af9:/workspace# git clone https://github.com/milkv-duo/tpu-mlir.git
root@cfa2f8976af9:/workspace# cd yolov3
root@cfa2f8976af9:/workspace/yolov3#  cp -rf ../tpu-mlir/regression/dataset/COCO2017/ ./

5.2. 模型转换

教程测试的是416x416尺寸的yolov3模型,使用Caffe。 另外,需要注意是带后处理的yolov3模型,在yolov3_416.prototxt文件,添加有一个名称是output,类型为YoloDetection的层:

layer {
    bottom: "layer106-conv"
    bottom: "layer94-conv"
    bottom: "layer82-conv"
    top: "output"
    name: "yolov3Detection"
    type: "YoloDetection"
    yolo_detection_param {
        net_input_w: 416
        net_input_h: 416
        nms_threshold: 0.45
        obj_threshold: 0.25
        keep_topk: 200
        class_num: 80
    }
}

使用model_transform(model_transform.py)工具将模型转成MLIR模型:

# model_def指定指定模型定义文件,input_shapes指定输入形状,
# mean和scale指定均值和归一化参数,pixel_format指定输入图像格式,
# test_input指定验证的图像,mlir指定输出mlir模型文件名称和路径。
root@cfa2f8976af9:/workspace/tpu/yolov3# model_transform.py --model_name yolov3 \
                                --model_def ./yolov3_416.prototxt \
                                --model_data ./yolov3.caffemodel \
                                --test_input ./dog.jpg \
                                --test_result yolov3_top_output.npz \
                                --input_shapes [[1,3,416,416]] \
                                --resize_dims=416,416 \
                                --keep_aspect_ratio \
                                --mean 0.0,0.0,0.0 \
                                --scale 0.00392,0.00392,0.00392 \
                                --pixel_format rgb \
                                --tolerance 0.99,0.99 \
                                --mlir yolov3.mlir
# --excepts output \
# 省略..................................................
[layer106-conv                   ]      SIMILAR [PASSED]
    (1, 255, 52, 52) float32
    cosine_similarity      = 1.000000
    euclidean_similarity   = 0.999999
    sqnr_similarity        = 115.155401
[output                        ]        CLOSE [PASSED]
    (1, 1, 200, 6) float32
    close order            = 5
104 compared
104 passed
1 equal, 1 close, 102 similar
0 failed
0 not equal, 0 not similar
min_similiarity = (0.9999997019767761, 0.9999982038586644, 112.97511100769043)
Target    yolov3_top_output.npz
Reference yolov3_ref_outputs.npz
npz compare PASSED.
compare output: 100%|█████████████████████████████████████████| 104/104 [00:07<00:00, 14.18it/s]
[Success]: npz_tool.py compare yolov3_top_output.npz yolov3_ref_outputs.npz --tolerance 0.99,0.99 --except - -vv

5.2.1. INT8量化模型

运行run_calibration(run_calibration.py)得到校准表,输入数据使用前面复制的100张来自COCO2017的图片,执行命令:

root@cfa2f8976af9:/workspace/tpu/yolov3# run_calibration yolov3.mlir \
        --dataset ./COCO2017/ \
        --input_num 100 \
        -o yolov3_cali_table
TPU-MLIR v1.8.1-20240712
GmemAllocator use FitFirstAssign
reused mem is 11075584, all mem is 132385491
2024/07/25 11:32:42 - INFO :
load_config Preprocess args :
        resize_dims           : [416, 416]
        keep_aspect_ratio     : True
        keep_ratio_mode       : letterbox
        pad_value             : 0
        pad_type              : center
        input_dims            : [416, 416]
        --------------------------
        mean                  : [0.0, 0.0, 0.0]
        scale                 : [0.00392, 0.00392, 0.00392]
        --------------------------
        pixel_format          : rgb
        channel_format        : nchw

last input data (idx=100) not valid, droped
input_num = 100, ref = 100
real input_num = 100
activation_collect_and_calc_th for op: layer107: 100%|████████████████████████████████████████████████████| 248/248 [02:56<00:00,  1.41it/s]
[2048] threshold: layer107: 100%|█████████████████████████████████████████████████████████████████| 248/248 [00:00<00:00, 846.23it/s]
[2048] threshold: layer107: 100%|████████████████████████████████████████████████████████████████| 248/248 [00:02<00:00, 89.39it/s]
GmemAllocator use FitFirstAssign
reused mem is 11075584, all mem is 132385491
GmemAllocator use FitFirstAssign
reused mem is 11075584, all mem is 132385491
prepare data from 100
tune op: layer107: 100%|██████████████████████████████████████████| 248/248 [04:50<00:00,  1.17s/it]
auto tune end, run time:294.8394968509674

然后使用model_deploy(model_deploy.py)工具将MLIR模型转成INT8 cvimodel模型:

root@cfa2f8976af9:/workspace/tpu/yolov3# model_deploy.py \
                                --mlir yolov3.mlir \
                                --calibration_table yolov3_calibration_table \
                                --chip cv181x \
                                --quantize INT8 \
                                --quant_input \
                                --test_input ./dog.jpg \
                                --test_reference yolov3_top_output.npz \
                                --excepts output \
                                --tolerance 0.9,0.3 \
                                --fuse_preprocess \
                                --customization_format RGB_PLANAR \
                                --model yolov3_416.cvimodel
# 省略.................................................
[Success]: tpuc-opt yolov3_cv181x_int8_sym_final.mlir --codegen="model_file=yolov3_416.cvimodel
embed_debug_info=false model_version=latest" -o /dev/null
[CMD]: model_runner.py --input yolov3_in_ori.npz --model yolov3_416.cvimodel --output yolov3_cv181x_int8_sym_model_outputs.npz
setenv:cv181x
Start TPU Simulator for cv181x
device[0] opened, 4294967296
version: 1.4.0
yolov3 Build at 2024-07-25 11:46:21 For platform cv181x
Cmodel: bm_load_cmdbuf
Max SharedMem size:8306688
Cmodel: bm_run_cmdbuf
device[0] closed
[Running]: npz_tool.py compare yolov3_cv181x_int8_sym_model_outputs.npz yolov3_cv181x_int8_sym_tpu_outputs.npz
 --tolerance 0.99,0.90 --except output -vv
compare output:   0%|                    | 0/1 [00:00<?, ?it/s][output                        ]        EQUAL [PASSED]
    (1, 1, 200, 6) float32
1 compared
1 passed
1 equal, 0 close, 0 similar
0 failed
0 not equal, 0 not similar
min_similiarity = (1.0, 1.0, inf)
Target    yolov3_cv181x_int8_sym_model_outputs.npz
Reference yolov3_cv181x_int8_sym_tpu_outputs.npz
npz compare PASSED.
compare output: 100%|█████████████████████████████████████| 1/1 [00:00<00:00, 32.32it/s]
[Success]: npz_tool.py compare yolov3_cv181x_int8_sym_model_outputs.npz yolov3_cv181x_int8_sym_tpu_outputs.npz
--tolerance 0.99,0.90 --except output -vv

# 输出模型文件yolov3_416.cvimodel,将其复制到板卡上,进行后部署测试

其中设置了参数fuse_preprocess,是将预处理融入到模型中,如果没有设置,在模型推理预处理阶段需要进行相应的量化等操作。

5.3. 部署测试

部署运行使用tpu-sdk库,相关API参考下 TPU SDK 开发资料汇总 中的CVITEK_TPU_SDK开发指南。

5.3.1. 部分例程解析

yolov3模型转换设置了–fuse_preprocess参数,输入图像预处理不需要结果归一化等操作,但是需要输入rgb nchw格式数据。 另外,yolov3模型加了 YoloDetection 层,一些后处理直接加在模型中,部署例程中后处理部分不需要进行bbox解码和nms等操作 ,只需简单处理就输出检测结果。

yolov3.cpp
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
// 省略...............
// opencv 读取图片
cv::Mat image;
image = cv::imread(argv[2]);
if (!image.data) {
    printf("Could not open or find the image\n");
    return -1;
}
cv::Mat cloned = image.clone();

detection dets[MAX_DET];
int32_t det_num = 0;

/* 预处理 */
int ih = image.rows;
int iw = image.cols;
int oh = height;
int ow = width;
double scale = std::min((double)oh / ih, (double)ow / iw);
int nh = (int)(ih * scale);
int nw = (int)(iw * scale);
// resize & letterbox
cv::resize(image, image, cv::Size(nw, nh));
int top = (oh - nh) / 2;
int bottom = (oh - nh) - top;
int left = (ow - nw) / 2;
int right = (ow - nw) - left;
cv::copyMakeBorder(image, image, top, bottom, left, right, cv::BORDER_CONSTANT,
                    cv::Scalar::all(0));
cv::cvtColor(image, image, cv::COLOR_BGR2RGB);

//Packed2Planar
cv::Mat channels[3];
for (int i = 0; i < 3; i++) {
    channels[i] = cv::Mat(image.rows, image.cols, CV_8SC1);
}
cv::split(image, channels);

// fill data
int8_t *ptr = (int8_t *)CVI_NN_TensorPtr(input);
int channel_size = height * width;
for (int i = 0; i < 3; ++i) {
    memcpy(ptr + i * channel_size, channels[i].data, channel_size);
}

/* 模型推理 */
gettimeofday(&start_time, NULL);
CVI_NN_Forward(model, input_tensors, input_num, output_tensors, output_num);
gettimeofday(&stop_time, NULL);
printf("CVI_NN_Forward using %f ms\n", (__get_us(stop_time) - __get_us(start_time)) / 1000);
printf("CVI_NN_Forward succeeded\n");

/* 后处理 */
float *output_ptr = (float *)CVI_NN_TensorPtr(output);
for (int i = 0; i < MAX_DET; ++i) {
    // filter real det with score > 0
    if (output_ptr[i * 6 + 5] > 0) {
    // output: [x,y,w,h,cls,score]
    dets[det_num].bbox.x = output_ptr[i * 6 + 0];
    dets[det_num].bbox.y = output_ptr[i * 6 + 1];
    dets[det_num].bbox.w = output_ptr[i * 6 + 2];
    dets[det_num].bbox.h = output_ptr[i * 6 + 3];
    dets[det_num].cls = output_ptr[i * 6 + 4];
    dets[det_num].score = output_ptr[i * 6 + 5];
    det_num++;
    }
}
printf("get detection num: %d\n", det_num);

// correct box with origin image size
int restored_w = 0;
int restored_h = 0;
bool relative_position = false;
if (((float)width / cloned.cols) < ((float)height / cloned.rows)) {
    restored_w = width;
    restored_h = (cloned.rows * width) / cloned.cols;
} else {
    restored_h = height;
    restored_w = (cloned.cols * height) / cloned.rows;
}
for (int i = 0; i < det_num; ++i) {
    box b = dets[i].bbox;
    b.x = (b.x - (width - restored_w) / 2. / width) /
        ((float)restored_w / width);
    b.y = (b.y - (height - restored_h) / 2. / height) /
        ((float)restored_h / height);
    b.w *= (float)width / restored_w;
    b.h *= (float)height / restored_h;
    if (!relative_position) {
    b.x *= cloned.cols;
    b.w *= cloned.cols;
    b.y *= cloned.rows;
    b.h *= cloned.rows;
    }
    dets[i].bbox = b;
}

/* 绘制结果框并输出信息 */
printf("-------------------\n");
for (int i = 0; i < det_num; i++) {
    box b = dets[i].bbox;
    // xywh2xyxy
    int x1 = (b.x - b.w / 2);
    int y1 = (b.y - b.h / 2);
    int x2 = (b.x + b.w / 2);
    int y2 = (b.y + b.h / 2);
    printf("%s @ (%d %d %d %d) %.3f\n", coco_names[dets[i].cls], x1,y1,x2,y2,dets[i].score);
    cv::rectangle(cloned, cv::Point(x1, y1), cv::Point(x2, y2), cv::Scalar(255, 255, 0),
                3, 8, 0);
    cv::putText(cloned, coco_names[dets[i].cls], cv::Point(x1, y1),
                cv::FONT_HERSHEY_DUPLEX, 1.0, cv::Scalar(0, 0, 255), 2);
}
printf("-------------------\n");

// save or show picture
cv::imwrite(argv[3], cloned);

CVI_NN_CleanupModel(model);
printf("CVI_NN_CleanupModel succeeded\n");
return 0;
}

5.3.2. 编译并测试例程

执行下面命令, 根据部署的板卡系统选择交叉编译器,如果是arm64设置aarch64-linux-gnu,如果是riscv64设置riscv64-linux-musl-x86_64

# 获取交叉编译器(如果前面测试获取了交叉编译器,就不需要),教程测试rsiv64系统,并设置交叉编译器到环境变量
wget https://sophon-file.sophon.cn/sophon-prod-s3/drive/23/03/07/16/host-tools.tar.gz
cd /workspace/tpu-mlir
tar xvf host-tools.tar.gz
cd host-tools
export PATH=$PATH:$(pwd)/gcc/riscv64-linux-musl-x86_64/bin
# 如果是aarch64系统,设置交叉编译器到环境变量
#export PATH=$PATH:$(pwd)/gcc/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu/bin

# 检测交叉编译
riscv64-unknown-linux-musl-gcc -v
Using built-in specs.
COLLECT_GCC=riscv64-unknown-linux-musl-gcc
COLLECT_LTO_WRAPPER=/home/dev/sg2000/cvi_mmf_sdk_test/host-tools/gcc/riscv64-linux-musl-x86_64/bin/
../libexec/gcc/riscv64-unknown-linux-musl/10.2.0/lto-wrapper
Target: riscv64-unknown-linux-musl
Configured with: /mnt/ssd/jenkins_iotsw/slave/workspace/Toolchain/build-gnu-riscv_4/
./source/riscv/riscv-gcc/configure #省略................
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.2.0 (Xuantie-900 linux-5.10.4 musl gcc Toolchain V2.6.1 B-20220906

获取配套例程文件:

# 获取配套例程文件并切换到yolov3目录下
git clone https://gitee.com/LubanCat/lubancat_sg2000_application_code.git

编译部署例程:

# 切换到yolov3目标检测例程目录
cd lubancat_sg2000_application_code/examples/yolov3

# 编译yolov3例程,请根据不同板卡系统设置-a参数,如果是arm64设置aarch64,如果是riscv64设置musl_riscv64
./build.sh -a musl_riscv64
-- CMAKE_C_COMPILER: riscv64-unknown-linux-musl-gcc
-- CMAKE_CXX_COMPILER: riscv64-unknown-linux-musl-g++
-- CMAKE_C_COMPILER: riscv64-unknown-linux-musl-gcc
-- CMAKE_CXX_COMPILER: riscv64-unknown-linux-musl-g++
-- The C compiler identification is GNU 10.2.0
-- The CXX compiler identification is GNU 10.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /home/dev/host-tools/gcc/riscv64-linux-musl-x86_64/bin/riscv64-unknown-linux-musl-gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/dev//host-tools/gcc/riscv64-linux-musl-x86_64/bin/riscv64-unknown-linux-musl-g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done (0.9s)
-- Generating done (0.0s)
-- Build files have been written to: /home/dev/xxx/samples/yolov3/build/build_riscv_musl
[2/3] Install the project...
-- Install configuration: "RELEASE"
# 省略.................

在板卡上执行程序,推理测试:

# 如果是riscv系统需要设置下环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib64v0p7_xthead/lp64d

# 运行程序指定cvimodel模型路径,测试图片路径,输出结果图片路径
# ./yolov3 cvimodel image.jpg image_detected.jpg
root@lubancat:/home/cat# ./yolov3 yolov3_416.cvimodel dog.jpg out.jpg
version: 1.4.0
yolov3 Build at 2024-07-25 11:46:21 For platform cv181x
Max SharedMem size:8306688
CVI_NN_RegisterModel succeeded
Input Tensor Number  : 1
[0] data_raw, shape (1,3,416,416), count 519168, fmt 7
Output Tensor Number : 1
[0] layer107, shape (1,1,200,6), count 1200, fmt 0
CVI_NN_Forward using 295.636000 ms
CVI_NN_Forward succeeded
get detection num: 3
-------------------
truck @ (474 74 688 178) 0.950
bicycle @ (179 136 544 425) 0.995
dog @ (132 237 312 525) 0.984
-------------------
CVI_NN_CleanupModel succeeded

推理输出模型的输入输出Tensor信息,输出目标检测信息等等,结果图片保存在当前目录下out.jpg。