20. TTS¶
TTS(Text To Speech)是将文本转成语音,也就是语音合成。
20.1. MMS-TTS¶
MMS-TTS是Facebook公司Massively Multilingual Speech(MMS)项目的一部分,MMS项目旨在为多种语言提供语音技术支持,覆盖了广泛的语言范围。
mms-tts模型是基于VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)架构开发的, 能够将文本转换为高质量的语音输出。
20.1.1. mms-tts-eng简单使用¶
在一个环境中安装transformers等相关库,然后测试mms-tts-eng。
conda create -n tts python=3.10
conda activate tts
# 安装相关库等等
pip install --upgrade transformers accelerate
获取 facebook/mms-tts-eng 模型文件(可选)。
git lfs install
# sudo apt update && sudo apt install git-lfs
git clone https://huggingface.co/facebook/mms-tts-eng
# 或者是镜像网址
git clone https://hf-mirror.com/facebook/mms-tts-eng
参考 示例程序 ,创建python程序文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import torch
from transformers import VitsTokenizer, VitsModel, set_seed
#tokenizer = VitsTokenizer.from_pretrained("path/to/mms-tts-eng")
#model = VitsModel.from_pretrained("path/to/mms-tts-eng")
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
model = VitsModel.from_pretrained("facebook/mms-tts-eng")
inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
set_seed(555) # make deterministic
with torch.no_grad():
outputs = model(**inputs)
waveform = outputs.waveform[0]
# 将结果保存.wav文件
import scipy
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform.numpy())
|
修改模型的路径为前面手动拉取的 facebook/mms-tts-eng 模型文件路径, 如果没有手动拉取文件,模型路径需要设置为”facebook/mms-tts-eng”,执行程序将会自动拉取文件。
执行程序将设置的英语文本转成语音,并保存在techno.wav文件。
# 执行测试命令
(tts) llh@llh:/xxx$ python test.py
20.1.2. 模型转换¶
1、转换成onnx模型
参考 rknn_model_zoo 中export_onnx.md说明,导出onnx模型。
在前面测试mms-tts-eng的环境中,获取rknn_model_zoo中modeling_vits_for_export_onnx.py,并 将该文件复制为transformers源文件vits模型目录下的modeling_vits.py, 例如:/home/xxx/anaconda3/envs/tts/lib/python3.10/site-packages/transformers/models/vits/modeling_vits.py ,然后再执行export_onnx.py程序。
(tts) llh@llh:/xxx$ mkdir ../model
# 如果自行拉取了mms-tts-eng模型文件,需要修改export_onnx.py程序中的模型路径
model, tokenizer = setup_model("path/to/mms-tts-eng")
# 执行export_onnx.py程序导出onnx模型,可修改--max_length参数(100,200,300)
(tts) llh@llh:/xxx$ python export_onnx.py --max_length 200
将在model目录下生成mms_tts_eng_decoder_200.onnx和mms_tts_eng_encoder_200.onnx文件。
提示
转换模型环境中pytorch或者onnx版本与rknn_model_zoo中export_onnx.md文件说明不同,
可能会出现 RuntimeError: Trying to create tensor with negative dimension -2
等等问题,可以尝试改下torch.onnx.export的opset_version参数。
2、转换成rknn模型
使用toolkit2工具将onnx转换成rknn模型,具体程序参考: rknn_model_zoo , toolkit2的环境按照参考下 前面教程 。
# 执行convert.py程序导出onnx模型,教程测试鲁班猫4
(toolkit2.3) llh@llh:/xxx$ python convert.py ./mms_tts_eng_encoder_200.onnx rk3588
I rknn-toolkit2 version: 2.3.0
--> Config model
done
--> Loading model
I Loading : 100%|██████████████████████████████████████████████| 240/240 [00:00<00:00, 20289.29it/s]
W load_onnx: The config.mean_values is None, zeros will be set for input 0!
W load_onnx: The config.std_values is None, ones will be set for input 0!
W load_onnx: The config.mean_values is None, zeros will be set for input 1!
W load_onnx: The config.std_values is None, ones will be set for input 1!
W load_onnx: The config.mean_values is None, zeros will be set for input 2!
W load_onnx: The config.std_values is None, ones will be set for input 2!
W load_onnx: The config.mean_values is None, zeros will be set for input 3!
W load_onnx: The config.std_values is None, ones will be set for input 3!
done
--> Building model
I OpFusing 0: 100%|██████████████████████████████████████████████| 100/100 [00:00<00:00, 313.77it/s]
I OpFusing 1 : 100%|█████████████████████████████████████████████| 100/100 [00:00<00:00, 103.39it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:01<00:00, 70.28it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:01<00:00, 68.71it/s]
I OpFusing 2 : 100%|██████████████████████████████████████████████| 100/100 [00:03<00:00, 25.14it/s]
I rknn building ...
I rknn building done.
done
--> Export rknn model
done
(toolkit2.2) llh@llh:/xxx$ python convert.py ./mms_tts_eng_decoder_200.onnx rk3588
I rknn-toolkit2 version: 2.3.0
--> Config model
done
--> Loading model
I Loading : 100%|█████████████████████████████████████████████| 851/851 [00:00<00:00, 218082.28it/s]
done
--> Building model
W build: For tensor ['793'], the value smaller than -3e+38 has been corrected to -10000. Set opt_level to 2 or lower to disable this correction.
I OpFusing 0: 100%|███████████████████████████████████████████████| 100/100 [00:01<00:00, 94.64it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:02<00:00, 33.52it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:05<00:00, 19.85it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:05<00:00, 18.75it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:07<00:00, 14.09it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:07<00:00, 13.98it/s]
I OpFusing 2 : 100%|██████████████████████████████████████████████| 100/100 [00:07<00:00, 13.79it/s]
I OpFusing 0 : 100%|██████████████████████████████████████████████| 100/100 [00:07<00:00, 12.75it/s]
I OpFusing 1 : 100%|██████████████████████████████████████████████| 100/100 [00:07<00:00, 12.65it/s]
I OpFusing 2 : 100%|██████████████████████████████████████████████| 100/100 [00:08<00:00, 11.76it/s]
I rknn building ...
# 省略.....................
I rknn building done.
done
--> Export rknn model
done
20.1.3. 测试例程¶
板卡上获取rknn_model_zoo中的例程,然后前面获得的rknn模型放到mode目录下:
# 安装git等等
sudo apt update
sudo apt install git make gcc g++ libsndfile1-dev
# 拉取rknn_model_zoo例程测试,实际编译操作请查看工程的README文件
git clone https://github.com/airockchip/rknn_model_zoo.git
切换到rknn_model_zoo目录下,修改下3rdparty/CMakeLists.txt文件,修改libsndfile库使用系统安装的libsndfile库。
# libsndfile
#set(LIBSNDFILE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/libsndfile)
#set(LIBSNDFILE_INCLUDES ${LIBSNDFILE_PATH}/include PARENT_SCOPE)
#set(LIBSNDFILE ${LIBSNDFILE_PATH}/${CMAKE_SYSTEM_NAME}/${TARGET_LIB_ARCH}/libsndfile.a PARENT_SCOPE)
set(LIBSNDFILE_PATH /usr/)
set(LIBSNDFILE_INCLUDES ${LIBSNDFILE_PATH}/include PARENT_SCOPE)
set(LIBSNDFILE ${LIBSNDFILE_PATH}/lib/aarch64-linux-gnu/libsndfile.so PARENT_SCOPE)
然后编译例程(教程测试鲁班猫4,ubuntu系统, 参数设置rk3588),将生成rknn_mms_tts_demo例程。
cat@lubancat:~$ cd rknn_model_zoo
# -t参数设置平台,教程测试lubancat-4,设置rk3588
cat@lubancat:/xxx/rknn_model_zoo$ ./build-linux.sh -t rk3588 -a aarch64 -d mms_tts
./build-linux.sh -t rk3588 -a aarch64 -d mms_tts
aarch64-linux-gnu
===================================
BUILD_DEMO_NAME=mms_tts
BUILD_DEMO_PATH=examples/mms_tts/cpp
TARGET_SOC=rk3588
TARGET_ARCH=aarch64
BUILD_TYPE=Release
ENABLE_ASAN=OFF
DISABLE_RGA=OFF
DISABLE_LIBJPEG=OFF
INSTALL_DIR=/home/cat/rknn_model_zoo/install/rk3588_linux_aarch64/rknn_mms_tts_demo
BUILD_DIR=/home/cat/rknn_model_zoo/build/build_rknn_mms_tts_demo_rk3588_linux_aarch64_Release
CC=aarch64-linux-gnu-gcc
CXX=aarch64-linux-gnu-g++
===================================
# 省略...............................
[ 16%] Built target imagedrawing
[ 25%] Building C object utils.out/CMakeFiles/audioutils.dir/audio_utils.c.o
[ 58%] Built target fileutils
[ 58%] Built target imageutils
[ 66%] Linking C static library libaudioutils.a
[ 66%] Built target audioutils
[ 83%] Building CXX object CMakeFiles/rknn_mms_tts_demo.dir/rknpu2/mms_tts.cc.o
[ 83%] Building CXX object CMakeFiles/rknn_mms_tts_demo.dir/process.cc.o
[ 91%] Building CXX object CMakeFiles/rknn_mms_tts_demo.dir/main.cc.o
[100%] Linking CXX executable rknn_mms_tts_demo
[100%] Built target rknn_mms_tts_demo
[ 16%] Built target audioutils
[ 33%] Built target fileutils
[ 66%] Built target rknn_mms_tts_demo
[ 83%] Built target imagedrawing
[100%] Built target imageutils
# 省略...............................
测试rknn_mms_tts_demo例程:
cd install/rk3588_linux_aarch64/rknn_mms_tts_demo/
# 命令使用
./rknn_mms_tts_demo <encoder_path> <decoder_path> <input_text>
# 测试lubancat-4
cat@lubancat:~/xxx$ ./rknn_mms_tts_demo ../../../model/mms_tts_eng_encoder_200.rknn
../../../model/mms_tts_eng_decoder_200.rknn '"Mister quilter is the apostle of the middle classes and we are glad to welcome his gospel."'
model input num: 2, output num: 4
input tensors:
index=0, name=input_ids, n_dims=2, dims=[1, 200], n_elems=200, size=1600, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
index=1, name=attention_mask, n_dims=2, dims=[1, 200], n_elems=200, size=1600, fmt=UNDEFINED, type=INT64, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=log_duration, n_dims=3, dims=[1, 1, 200], n_elems=200, size=400, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=1, name=input_padding_mask, n_dims=3, dims=[1, 1, 200], n_elems=200, size=400, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=2, name=prior_means, n_dims=3, dims=[1, 200, 192], n_elems=38400, size=76800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=3, name=prior_log_variances, n_dims=3, dims=[1, 200, 192], n_elems=38400, size=76800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_mms_tts_encoder_model use: 85.587997 ms
model input num: 4, output num: 1
input tensors:
index=0, name=attn, n_dims=4, dims=[1, 400, 200, 1], n_elems=80000, size=160000, fmt=NHWC, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=1, name=output_padding_mask, n_dims=3, dims=[1, 1, 400], n_elems=400, size=800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=2, name=prior_means, n_dims=3, dims=[1, 200, 192], n_elems=38400, size=76800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
index=3, name=prior_log_variances, n_dims=3, dims=[1, 200, 192], n_elems=38400, size=76800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=waveform, n_dims=2, dims=[1, 102400], n_elems=102400, size=204800, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
-- init_mms_tts_decoder_model use: 158.074005 ms
-- read_vocab use: 0.013000 ms
-- inference_mms_tts_model use: 669.695007 ms
Real Time Factor (RTF): 0.670 / 6.400 = 0.105
The output wav file is saved: output.wav
语音结果保存在当前目录的output.wav