19. Zipformer

Zipformer是新一代kaldi团队最新研发的序列建模模型。 相比较于Conformer、Squeezeformer、E-Branchformer等主流ASR模型,Zipformer具有效果更好、计算更快、更省内存等优点。 Zipformer在LibriSpeech、Aishell-1和WenetSpeech等常用数据集上取得了当前最好的ASR结果。

19.1. Zipformer部署测试

参考 rknn_model_zoo 中提供方法在板卡上部署测试Zipformer。

19.1.1. 导出onnx模型

安装 icefall 环境, 详细安装命令请参考 icefall文档

# 使用conda创建虚拟环境
conda create -n icefall python=3.11
conda activate icefall

# 根据自行的环境安装pytorch,下面是简单参考命令:
# Install Pytorch
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu

# Install k2
pip install k2==1.24.4.dev20250307+cpu.torch2.6.0 -f https://k2-fsa.github.io/k2/cpu-cn.html

# 还有一些库等等
pip install torchaudio lhotse

# 获取icefall源码
git clone https://github.com/k2-fsa/icefall.git
cd egs/librispeech/ASR

建议直接使用icefall的 docker环境镜像

然后获取预训练模型:

repo_url=https://huggingface.co/csukuangfj/k2fsa-zipformer-bilingual-zh-en-t
GIT_LFS_SKIP_SMUDGE=1 git clone $repo_url
repo=$(basename $repo_url)

pushd $repo
git lfs pull --include "data/lang_char_bpe/bpe.model"
git lfs pull --include "exp/pretrained.pt"
cd exp
ln -s pretrained.pt epoch-99.pt
popd

参考 export-for-onnx.sh , 导出onnx模型。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env bash

export CUDA_VISIBLE_DEVICES=""
set -ex

dir=path/to/k2fsa-zipformer-bilingual-zh-en-t
if [ ! -f $dir/exp/epoch-99.pt ]; then
pushd $dir/exp
ln -s pretrained.pt epoch-99.pt
popd
fi

./pruned_transducer_stateless7_streaming/export-onnx-zh.py \
--tokens $dir/data/lang_char_bpe/tokens.txt \
--use-averaged-model 0 \
--epoch 99 \
--avg 1 \
--exp-dir $dir/exp/ \
--decode-chunk-len 96 \
\
--num-encoder-layers 2,2,2,2,2 \
--feedforward-dims 768,768,768,768,768 \
--nhead 4,4,4,4,4 \
--encoder-dims 256,256,256,256,256 \
--attention-dims 192,192,192,192,192 \
--encoder-unmasked-dims 192,192,192,192,192 \
\
--zipformer-downsampling-factors "1,2,4,8,2" \
--cnn-module-kernels "31,31,31,31,31" \
--decoder-dim 512 \
--joiner-dim 512 \
--dynamic-batch 0

需要注意的设置dir变量为前面获取k2fsa-zipformer-bilingual-zh-en-t的路径, 还有设置 --dynamic-batch 参数为0(拉取最新版本的 icefall ),也就是导出的模型batch size = 1。

# 将前面的export-for-onnx.sh 复制到icefall的egs/librispeech/ASR目录下,然后执行程序
llh@llh:/xxx/icefall/egs/librispeech/ASR$ ./export-for-onnx.sh
2025-03-17 02:55:37,438 INFO [export-onnx-zh.py:520] device: cpu
2025-03-17 02:55:37,445 INFO [export-onnx-zh.py:531] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1,
'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4,
# 省略...........................
checkpoint = torch.load(filename, map_location="cpu")
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:635] encoder parameters: 19451231
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:636] decoder parameters: 3468800
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:637] joiner parameters: 3208302
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:638] total parameters: 26128333
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:651] Exporting encoder
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:303] decode_chunk_len: 96
2025-03-17 02:55:37,841 INFO [export-onnx-zh.py:304] pad_length: 7
# 省略...........................

在k2fsa-zipformer-bilingual-zh-en-t的路径的exp目录下生成encoder-epoch-99-avg-1.onnx,decoder-epoch-99-avg-1.onnx和joiner-epoch-99-avg-1.onnx等模型。

19.1.2. 转换成rknn模型

模型转换成rknn模型,需要使用toolkit2工具,toolkit2的安装请参考前面章节。

转换程序参考 rknn_model_zoo 中的conver.py, 或者 icefall 中的export_rknn.py例程。

教程简单测试使用 icefall 中export_rknn.py,直接将前面导出的3个onnx模型转换成rknn模型。

# 在toolkit2.3.0环境中,设置三个输入模型路径和输出模型路径(教程测试鲁班猫4设置rk3588)
(toolkit2_2.3) llh@llh:/xxx$ python export_rknn.py  --target-platform rk3588 --in-encoder ./encoder-epoch-99-avg-1.onnx \
--in-decoder ./decoder-epoch-99-avg-1.onnx  --in-joiner ./joiner-epoch-99-avg-1.onnx \
--out-encoder ./encoder-epoch-99-avg-1.rknn --out-decoder ./decoder-epoch-99-avg-1.rknn --out-joiner ./joiner-epoch-99-avg-1.rknn
{'target_platform': 'rk3588', 'in_encoder': './encoder-epoch-99-avg-1.onnx', 'in_decoder': './decoder-epoch-99-avg-1.onnx',
'in_joiner': './joiner-epoch-99-avg-1.onnx', 'out_encoder': './encoder-epoch-99-avg-1.rknn',
'out_decoder': './decoder-epoch-99-avg-1.rknn', 'out_joiner': './joiner-epoch-99-avg-1.rknn'}
{'cnn_module_kernels': '31,31,31,31,31', 'attention_dims': '192,192,192,192,192', 'encoder_dims': '256,256,256,256,256',
 'left_context_len': '192,96,48,24,96', 'num_encoder_layers': '2,2,2,2,2', 'T': '103', 'decode_chunk_len': '96',
 'version': '1', 'model_author': 'k2-fsa', 'model_type': 'zipformer'}
{'vocab_size': '6254', 'context_size': '2'}
I rknn-toolkit2 version: 2.3.0
I Loading : 100%|██████████████████████████████████████████████| 369/369 [00:00<00:00, 32542.01it/s]
#中间省略..........................
I OpFusing 1 : 100%|███████████████████████████████████████████| 100/100 [00:00<00:00, 11954.35it/s]
I OpFusing 2 : 100%|███████████████████████████████████████████| 100/100 [00:00<00:00, 10950.61it/s]
I OpFusing 0 : 100%|███████████████████████████████████████████| 100/100 [00:00<00:00, 10128.72it/s]
I OpFusing 1 : 100%|████████████████████████████████████████████| 100/100 [00:00<00:00, 9654.28it/s]
I OpFusing 2 : 100%|████████████████████████████████████████████| 100/100 [00:00<00:00, 3540.75it/s]
I rknn building ...
I rknn building done.
model_type=zipformer;attention_dims=192,192,192,192,192;encoder_dims=256,256,256,256,256;T=103;left_context_len=192,96,48,24,96;
decode_chunk_len=96;cnn_module_kernels=31,31,31,31,31;num_encoder_layers=2,2,2,2,2;context_size=2

19.1.3. 例程测试

1、 在板卡上部署,使用 rknn_model_zoo 中提供的例程。

# 配套例程
//待加

# 获取测试例程
git clone https://github.com/airockchip/rknn_model_zoo

修改编译器路径(板卡上就使用默认安装的编译器),然后将前面转换出的3个rknn模型复制到zipformer/models目录下,板卡上编译例程:

cat@lubancat:~/xxx$ ./build-linux.sh -t rk3588 -a aarch64 -d zipformer
===================================
# 省略................
===================================
-- Configuring done
-- Generating done
-- Build files have been written to: /xxx/build_rk3588_linux
[ 16%] Built target imageutils
[ 50%] Built target fileutils
[ 50%] Built target imagedrawing
[ 66%] Built target audioutils
[ 75%] Linking CXX executable rknn_zipformer_demo
[100%] Built target rknn_zipformer_demo
[ 16%] Built target fileutils
[ 33%] Built target audioutils
[ 66%] Built target rknn_zipformer_demo
[ 83%] Built target imageutils
[100%] Built target imagedrawing
# 省略................

切换到install/rk3588_linux目录下,然后执行测试程序:

# ./rknn_zipformer_demo <encoder_path> <decoder_path> <joiner_path> <audio_path>
cat@lubancat:~/xxx/install/rk3588_linux$ ./rknn_zipformer_demo ./model/encoder-epoch-99-avg-1.rknn
./../model/decoder-epoch-99-avg-1.rknn ./model/joiner-epoch-99-avg-1.rknn model/test.wav
-- read_audio & convert_channels & resample_audio & read_vocab use: 4.621000 ms
model input num: 36, output num: 36
input tensors:
# 省略................................................................
-- init_zipformer_encoder_model use: 226.266006 ms
model input num: 1, output num: 1
input tensors:
index=0, name=y, n_dims=2, dims=[1, 2], n_elems=2, size=16, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=decoder_out, n_dims=2, dims=[1, 512], n_elems=512, size=1024, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_zipformer_decoder_model use: 10.225000 ms
model input num: 2, output num: 1
input tensors:
index=0, name=encoder_out, n_dims=2, dims=[1, 512], n_elems=512, size=1024, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=1, name=decoder_out, n_dims=2, dims=[1, 512], n_elems=512, size=1024, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=logit, n_dims=2, dims=[1, 6254], n_elems=6254, size=12508, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_zipformer_joiner_model use: 7.846000 ms
-- inference_zipformer_model use: 1136.543945 ms

Real Time Factor (RTF): 1.137 / 5.841 = 0.195

Timestamp (s): 0.00, 0.48, 0.72, 0.88, 1.16, 1.40, 2.00, 2.04, 2.20, 2.36, 2.52, 2.68, 2.80, 3.36, 3.48,
3.64, 3.76, 3.88, 3.96, 4.04, 4.16, 4.28, 4.44, 4.60, 4.68, 5.16

Zipformer output: 对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢

2、 sherpa-onnx 也支持rknn部署。 sherpa-onnx是由Next-gen Kaldi团队开发的一个开源项目,旨在提供高效的离线语音识别和语音合成解决方案。

# 板卡上获取sherpa-onnx(教程测试时commit:823e2e6)
git clone https://github.com/k2-fsa/sherpa-onnx.git

教程测试使用build-rknn-linux-aarch64.sh脚本编译sherpa-onnx,也可以直接使用cmake命令编译等等, 详细请参考下 sherpa-onnx文档

# 安装相关软件等
sudo apt update
sudo apt install make cmake libtool

# 获取rknn-toolkit2
git clone https://github.com/airockchip/rknn-toolkit2.git

# 设置环境变量,或者自行修改build-rknn-linux-aarch64.sh,设置librknnrt.so库等
cat@lubancat:$ export SHERPA_ONNX_RKNN_TOOLKIT2_PATH=/path/to/rknn-toolkit2

# 切换到sherpa-onnx
cat@lubancat:$ cd sherpa-onnx
cat@lubancat:~/sherpa-onnx$ ./build-rknn-linux-aarch64.sh
+ CC=aarch64-linux-gnu-gcc
+ ./gitcompile --host=aarch64-linux-gnu
configure.ac:30: installing './compile'
configure.ac:15: installing './config.guess'
configure.ac:15: installing './config.sub'
configure.ac:16: installing './install-sh'
configure.ac:16: installing './missing'
alsalisp/Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'
# 省略..................

19.1.3.1. 语音文件识别

编译完成后,切换到build-rknn-linux-aarch64/install/bin目录下,测试sherpa-onnx程序,对语音文件进行识别。

cat@lubancat:~/sherpa-onnx$ cd build-rknn-linux-aarch64/install/bin

# 将前面转换出的3个rknn模型,vocab.txt和wav文件到相关目录下,然后设置参数
# 设置--encoder --decoder --joiner为前面模型路径,--provider为rknn,
cat@lubancat:~/xxx-aarch64/install/bin$ ./sherpa-onnx --tokens=/home/cat/zipformer/vocab.txt \
    --encoder=/path/to/zipformer/encoder-epoch-99-avg-1.rknn \
    --decoder=/home/cat/zipformer/decoder-epoch-99-avg-1.rknn \
    --joiner=/home/cat/zipformer/joiner-epoch-99-avg-1.rknn \
    --provider=rknn \
    ~/zipformer/test.wav

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True,
snip_edges=False), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="/home/cat/zipformer/encoder-epoch-99-avg-1.rknn",
decoder="/home/cat/zipformer/decoder-epoch-99-avg-1.rknn", joiner="/home/cat/zipformer/joiner-epoch-99-avg-1.rknn"),
paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), wenet_ctc=OnlineWenetCtcModelConfig(model="", chunk_size=16, num_left_chunks=4),
# 省略...........................
/home/cat/zipformer/test.wav
Number of threads: 1, Elapsed seconds: 0.52, Audio duration (s): 5.6, Real time factor (RTF) = 0.52/5.6 = 0.093
对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢
{ "text": "对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢", "tokens": ["对", "我", "做", "了", "介", "绍", "那", "么",
 "我", "想", "说", "的", "是", "大", "家", "如", "果", "对", "我", "的", "研", "究", "感", "兴", "趣", "呢"],
"timestamps": [0.00, 0.48, 0.72, 0.88, 1.16, 1.40, 2.00, 2.04, 2.20, 2.36, 2.52, 2.68, 2.80, 3.36, 3.48, 3.64,
3.76, 3.88, 3.96, 4.04, 4.16, 4.28, 4.44, 4.60, 4.68, 5.16],
 "ys_probs": [], "lm_probs": [], "context_scores": [], "segment": 0, "words": [], "start_time": 0.00, "is_final": false}

19.1.3.2. 实时语音识别

测试install/bin/目录下的sherpa-onnx-alsa程序,对实时语音进行识别。

# 查看录音设备(测试鲁班猫4),这里是card2 设备0,,后面命令将设置设备名为 plughw:2,0
cat@lubancat:~/xxx-aarch64/install/bin$ arecord -l
**** CAPTURE 硬體裝置清單 ****
card 2: rockchipes8388 [rockchip-es8388], device 0: dailink-multicodecs ES8323 HiFi-0 [dailink-multicodecs ES8323 HiFi-0]
子设备: 1/1
子设备 #0: subdevice #0

# 实时语音的识别
cat@lubancat:~/xxx-aarch64/install/bin$ ./sherpa-onnx-alsa \
    --tokens=/home/cat/zipformer/vocab.txt \
    --encoder=/home/cat/zipformer/encoder-epoch-99-avg-1.rknn \
    --decoder=/home/cat/zipformer/decoder-epoch-99-avg-1.rknn \
    --joiner=/home/cat/zipformer/joiner-epoch-99-avg-1.rknn \
    --provider=rknn \
    --num-threads=-4 \
    --decoding-method=greedy_search \
    plughw:2,0

/xxx/parse-options.cc:Read:375 ./bin/sherpa-onnx-alsa
--tokens=/home/cat/zipformer/vocab.txt --encoder=/home/cat/zipformer/encoder-epoch-99-avg-1.rknn
--decoder=/home/cat/zipformer/decoder-epoch-99-avg-1.rknn --joiner=/home/cat/zipformer/joiner-epoch-99-avg-1.rknn
--provider=rknn --num-threads=-4 --decoding-method=greedy_search plughw:2,0

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80,
low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False),
model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="/home/cat/zipformer/encoder-epoch-99-avg-1.rknn",
decoder="/home/cat/zipformer/decoder-epoch-99-avg-1.rknn", joiner="/home/cat/zipformer/joiner-epoch-99-avg-1.rknn"),
paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), wenet_ctc=OnlineWenetCtcModelConfig(model="",
chunk_size=16, num_left_chunks=4), zipformer2_ctc=OnlineZipformer2CtcModelConfig(model=""), nemo_ctc=OnlineNeMoCtcModelConfig(model=""),
# 省略...........................
Current sample rate: 16000
Recording started!
Use recording device: plughw:2,0
Started! Please speak