当前位置：首页 > web >正文

VITS：基于对抗学习的条件变分自编码器

web 2025/7/3 20:38:50

项目：https://github.com/CjangCjengh/vits.git

数据集：魔搭社区

参考：

https://zhuanlan.zhihu.com/p/679234403
https://east.moe/archives/1342

1，mp3 转 wav

VITS开源项目要求的音频格式要求：wav，单声道、22050Hz, PCM 16bit。可用我的代码做转换：

import os
from tqdm import tqdm
import soundfile as sf
from pydub import AudioSegmentdef audio_formater(input_files, target_sr=22050, pcm=16):"""将m4a格式转换成wav，并转化为单声道、22050Hz, PCM 16bit"""for input_file in tqdm(input_files):file_prefix, file_format = input_file.split('.')[:-1][0], input_file.split('.')[-1][0]# 格式转换if file_format != 'wav':audio = AudioSegment.from_file(input_file)input_file = f'{file_prefix}.wav'audio.export(input_file, 'wav')# 转换采样率source_file = f'{file_prefix}.wav'target_file = f'{file_prefix}_resample.wav'cmd_str = f"ffmpeg -i {source_file} -ac 1 -ar {target_sr} {target_file} -acodec pcm_s16le -y"os.system(cmd_str)input_files = ["your_path/audio1.m4a","your_path/audio2.m4a",
]
audio_formater(input_files)

2，切割音频

需要把长影片切割成小段音频，不然模型训练负担过重，用auto-slicer项目做音频切割。Audio-Slice 是一个开源的音频处理工具，它通过静音检测这种简单而有效的方式，根据预设的条件和参数，将音频文件切割并输出切割后的音频片段。直接去github拉取该项目：https://github.com/openvpi/audio-slicer，并配置好环境，然后执行以下代码：

import librosa
import soundfile
import os
import sys
sys.path.append('./audio-slicer')
from slicer2 import Slicerdef audio_slicer(input_files):'''切割给定的wav音频到到小段wavRef: https://github.com/openvpi/audio-slicerSlicer参数：Threshold（dB）：低于这个分贝就会切割Minimum Length（ms）：最小的切割长度Minimum Interval（ms）：最小的切割间隔Hop Size（ms）：粗略理解为精确度（10-20就行了）Maximum Silence Length（ms）：最大沉默（无声）时间 '''for input_file in tqdm(input_files):audio, sr = librosa.load(input_file, sr=None, mono=True)  # Load an audio file with librosa.file_prefix, file_format = input_file.split('.')[:-1][0], input_file.split('.')[-1][0]# 提前准备好保存路径save_path = f'{file_prefix}_clips'if not os.path.exists(save_path):os.makedirs(save_path)slicer = Slicer(sr=sr,threshold=-40,min_length=5000,min_interval=300,hop_size=10,max_sil_kept=1000)chunks = slicer.slice(audio)for i, chunk in enumerate(chunks):if len(chunk.shape) > 1:chunk = chunk.T  # Swap axes if the audio is stereo.# Save sliced audio files with soundfile.soundfile.write(f'{save_path}/audio_{i}.wav', chunk, sr)input_files = ["your_path/audio1.wav","your_path/audio2.wav",
]
audio_slicer(input_files)

3，语音识别 & 划分数据集

将上面切割的音频标注为文本：可以利用开源模型自动转换，比如 FunASR，Whisper-tiny，Whisper-base，OmniSenseVoice这些都可以。

import glob
import timeimport pandas as pd
import whisper
import os
def calculate_time(func):def wrapper(*args, **kwargs):start_time = time.time()result = func(*args, **kwargs)end_time = time.time()print(f"函数 {func.__name__} 运行时间： {end_time - start_time} 秒")return resultreturn wrapperclass WhisperASR:def __init__(self, model_path):self.LANGUAGES = {"en": "english","zh": "chinese",}self.model = whisper.load_model(model_path)@calculate_timedef transcribe(self, audio_file):result = self.model.transcribe(audio_file)return result["text"]def get_wav_files(directory):# 确保路径存在if not os.path.isdir(directory):raise ValueError(f"目录不存在: {directory}")# 获取所有 .wav 文件（不区分大小写）wav_files = glob.glob(os.path.join(directory, "*.wav"))return wav_filesif __name__ == "__main__":# 创建ASR对象并进行语音识别model_path = "base.pt"  # 模型路径audio_files = get_wav_files("D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips")file_paths, transcriptions = [], []for audio_file in audio_files:asr = WhisperASR(model_path)texts  = asr.transcribe(audio_file)text_concat = []for text in eval(texts):text_concat.append(text[2])text_concat = ','.join(text_concat)if text_concat[-1] != '。':text_concat += '。'file_paths.append(audio_file)# 注意，按需修改语言标识，如中文为[ZH]，英语为[EN]transcriptions.append('[ZH]' + text_concat + '[ZH]')annot_df = pd.DataFrame({'file_path':file_paths, 'transcription':transcriptions})

准备模型训练用的训练集和验证集。

train_ratio = 0.85
train_len = int(len(annot_df) * train_ratio)# 获取划分后的元数据
train_metadata = annot_df.iloc[:train_len,:]
test_metadata = annot_df.iloc[train_len:, :]
print(f'Train:{len(train_metadata)}')
print(f'Test:{len(test_metadata)}')# 保存结果
save_path = 'your_path'
all_annot_path = f'{save_path}/canton_single_speaker_all_filelist.txt'
train_annot_path = f'{save_path}/canton_single_speaker_train_filelist.txt'
test_annot_path = f'{save_path}/canton_single_speaker_val_filelist.txt'
annot_df.to_csv(all_annot_path, index=False, header=False, sep='|') 
train_metadata.to_csv(train_annot_path, index=False, header=False, sep='|')
test_metadata.to_csv(test_annot_path, index=False, header=False, sep='|')

4，预处理数据

执行 VITS 项目中的 preprocess.py，手动修改参数即可：

parser.add_argument("--filelists", nargs="+", default=["D:\\PyCharmWorkSpace\\vits\\data\\data_final\\canton_single_speaker_train_filelist.txt", "D:\\PyCharmWorkSpace\\vits\\data\\data_final\\canton_single_speaker_val_filelist.txt"])
parser.add_argument("--text_cleaners", nargs="+", default=["chinese_cleaners"])

报错：opencc.OpenCC('jyutjyu')

下载：https://github.com/Keith-Hon/vits-cantonese，在 opencc 目录下。

报错：converter = opencc.OpenCC('zaonhe')

注释：注释掉上海话和客家话，网上没找到这俩文件的配置。

# from text.shanghainese import shanghainese_to_ipa
# from text.cantonese import cantonese_to_ipa
# text = re.sub(r'\[SH\](.*?)\[SH\]', lambda x: shanghainese_to_ipa(x.group(1)).replace('1', '˥˧').replace('5','˧˧˦').replace('6', '˩˩˧').replace('7', '˥').replace('8', '˩˨').replace('ᴀ', 'ɐ').replace('ᴇ', 'e')+' ', text)
# text = re.sub(r'\[GD\](.*?)\[GD\]', lambda x: cantonese_to_ipa(x.group(1))+' ', text)

5，模型训练

首先，进入VITS项目内configs里训练用的配置文件 chinese_base.json，然后修改以下内容：
"training_files":"filelists/juzi_train_filelist.txt.cleaned",
"validation_files":"filelists/juzi_val_filelist.txt.cleaned",
# 如果是单个人说话，设置 0
"n_speakers": 0,
# 如果已经预处理了，设置 true
"cleaned_text": true
train 参数：

log_interval：保存间隔；
seed：种子数，相同的种子和数据集可以得出相同的模型参数；
epoch：训练多少个循环；
learning_rate：学习率；
batch_size：批大小，建议参考自己的显存大小设置，示例，设置24可以吃掉近10G显存；
fp16_run：是否使用半精度进行训练，使用可以节约显存并适当加快速度，具体参照自己显卡的fp16算力；

data 参数：

training_files：用于训练语音文本文件；
validation_files：用于评估效果的语音文本文件；
text_cleaners：语音cleaners，选择对应于语音语言的clezners；
n_speakers：语音包括的人数。单人需要填0；
cleaned_text：数据是否经过预处理，如果经过处理填写true，没有做过处理则填入false。

speakers：填写人物的名字或者代号，如果只打算训练一名人物，就不需要填写speakers这行，可以直接删除；

symbols：填写一些非当前语言的文字，或者会涉及语气的标点符号。要跟 text_cleaner 内的 symbols 保持一致，且要求 Unicode 编码。
import demjson3# chinese_dialect_cleaners
_pad        = '_'
_punctuation = ',.!?~…─'
_letters = '#Nabdefghijklmnoprstuvwxyzæçøŋœȵɐɑɒɓɔɕɗɘəɚɛɜɣɤɦɪɭɯɵɷɸɻɾɿʂʅʊʋʌʏʑʔʦʮʰʷˀː˥˦˧˨˩̥̩̃̚ᴀᴇ↑↓∅ⱼ '
# Export all symbols:
symbols = [_pad] + list(_punctuation) + list(_letters)moetts_symbols_text = demjson3.encode(symbols)
print(moetts_symbols_text)

【单人训练】

python train.py -c D:\PyCharmWorkSpace\vits\configs\chinese_base.json -m D:\PyCharmWorkSpace\vits\model

【多人训练】

python train_ms.py -c <config> -m <model_save_folder>

【修改】

添加环境变量

os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

把nccl改为gloo：

dist.init_process_group(backend='nccl', init_method='env://', world_size=n_gpus, rank=rank)

【报错】raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'object'

【解决】提高 tensorboard==2.3.0 的版本，因为 2.3.0 版本依赖 numpy 1.24 之前的。

【报错】ModuleNotFoundError: No module named 'monotonic_align.core'

执行 setup.py
python setup.py build_ext --inplace
继续报错：core.c(208): fatal error C1083: 无法打开包括文件: “longintrepr.h”: No such file or directory，这个报错处理极其复杂，直接用编译好的：
pip install monotonic-align
【报错】RuntimeError: use_libuv was requested but PyTorch was build without libuv support
os.environ["USE_LIBUV"] = "0"
【报错】IndexError: Replacement index 2 out of range for positional args tuple

修改：data_utils.TextAudioLoader.get_audio
raise ValueError("{} {} SR doesn't match target {} SR".format(sampling_rate, self.sampling_rate))
👇
raise ValueError("{} {} SR doesn't match target SR".format(sampling_rate, self.sampling_rate))
【报错】ValueError: 8000 22050 SR doesn't match target SR，采样率不匹配！
#!/bin/bash# 设置输入和输出目录
INPUT_DIR="D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips"
OUTPUT_DIR="D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips_22050"# 创建输出目录（如果不存在）
mkdir -p "$OUTPUT_DIR"# 遍历输入目录中的所有 .wav 文件
for input_file in "$INPUT_DIR"/*.wav; do# 获取不带路径和扩展名的文件名filename=$(basename "$input_file" .wav)# 构建输出文件路径output_file="$OUTPUT_DIR/${filename}.wav"# 执行 ffmpeg 转换ffmpeg -i "$input_file" -ar 22050 "$output_file"
done
【报错】RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
#修改mel_processing.py文件
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size,window=hann_window[str(y.device)], center=center,pad_mode='reflect', normalized=False,onesided=True,return_complex=False)
【报错】AttributeError: 'FigureCanvasAgg' object has no attribute 'tostring_rgb'
pip install matplotlib==3.7.0
【报错】parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`
# train.py 104行左右 加入find_unused_parameters=True避免报错
net_g = DDP(net_g, device_ids=[rank],find_unused_parameters=True)
net_d = DDP(net_d, device_ids=[rank],find_unused_parameters=True)

6，模型推理

执行：inference.ipynb，路径都用绝对路径！

utils.load_checkpoint("/path/to/model.pth", net_g, None)
👇
utils.load_checkpoint("/path/to/G_0.pth", net_g, None)
# G_0.pth or D_0.pth 都可

查看全文

http://www.xdnf.cn/news/1735.html

Java大师成长计划之第2天：面向对象编程在Java中的应用

【回眸】Aurix TC397 IST 以太网 UDP 相关开发

【python】Python 中，单下划线（_）和双下划线（__）开头以及结尾的命名方式具有特殊的含义和用途

RTSP播放器实现回调RGB|YUV给视觉算法，然后二次编码推送到RTMP服务

ORACLE DATAGUARD遇到GAP增量恢复方式修复RAC环境备机的实践

C语言教程（十五）：C 语言函数指针与回调函数详解

【高并发】 MySQL锁优化策略

rsync实现内网两台服务器文件同步

Winddows11官网下载安装VMware Workstation Pro17（图文详解）

Linux命令-perf

企业办公即时通讯软件BeeWorks，私有化安全防泄密

【MobaXterm】---修改 MobaXterm 终端默认字体和大小保真

基于 C++ 的用户认证系统开发：从注册登录到Redis 缓存优化

【技术派后端篇】整合WebSocket长连接实现消息实时推送

《Python3网络爬虫开发实战（第二版）》配套案例 spa6

数据结构——栈与队列

GPU热设计功耗（TDP）与计算效率的平衡艺术：动态频率调节对算法收敛速度的影响量化分析

【Leetcode 每日一题】2799. 统计完全子数组的数目

Spring Security结构总览

网络变更：APIC 节点替换

使用Tauri 2.3.1+Leptos 0.7.8开发桌面小程序汇总

【多智能体系统组织方式解析】五大架构赋能智能协作

java操作打印机直接打印及详细linux部署（只适用于机器和打印机处于同一个网段中）

windbg-A complete guide for Advanced Windows Debugging part1

深入解析 Docker 容器进程的 cgroup 和命名空间信息

机器学习 Day14 XGboost（极端梯度提升树）算法