VITS:基于对抗学习的条件变分自编码器
项目:https://github.com/CjangCjengh/vits.git
数据集:魔搭社区
参考:
- https://zhuanlan.zhihu.com/p/679234403
- https://east.moe/archives/1342
1,mp3 转 wav
VITS开源项目要求的音频格式要求:wav,单声道、22050Hz, PCM 16bit。可用我的代码做转换:
import os from tqdm import tqdm import soundfile as sf from pydub import AudioSegmentdef audio_formater(input_files, target_sr=22050, pcm=16):"""将m4a格式转换成wav,并转化为单声道、22050Hz, PCM 16bit"""for input_file in tqdm(input_files):file_prefix, file_format = input_file.split('.')[:-1][0], input_file.split('.')[-1][0]# 格式转换if file_format != 'wav':audio = AudioSegment.from_file(input_file)input_file = f'{file_prefix}.wav'audio.export(input_file, 'wav')# 转换采样率source_file = f'{file_prefix}.wav'target_file = f'{file_prefix}_resample.wav'cmd_str = f"ffmpeg -i {source_file} -ac 1 -ar {target_sr} {target_file} -acodec pcm_s16le -y"os.system(cmd_str)input_files = ["your_path/audio1.m4a","your_path/audio2.m4a", ] audio_formater(input_files)
2,切割音频
需要把长影片切割成小段音频,不然模型训练负担过重,用auto-slicer项目做音频切割。Audio-Slice 是一个开源的音频处理工具,它通过静音检测这种简单而有效的方式,根据预设的条件和参数,将音频文件切割并输出切割后的音频片段。直接去github拉取该项目:https://github.com/openvpi/audio-slicer,并配置好环境,然后执行以下代码:
import librosa import soundfile import os import sys sys.path.append('./audio-slicer') from slicer2 import Slicerdef audio_slicer(input_files):'''切割给定的wav音频到到小段wavRef: https://github.com/openvpi/audio-slicerSlicer参数:Threshold(dB):低于这个分贝就会切割Minimum Length(ms):最小的切割长度Minimum Interval(ms):最小的切割间隔Hop Size(ms):粗略理解为精确度(10-20就行了)Maximum Silence Length(ms):最大沉默(无声)时间 '''for input_file in tqdm(input_files):audio, sr = librosa.load(input_file, sr=None, mono=True) # Load an audio file with librosa.file_prefix, file_format = input_file.split('.')[:-1][0], input_file.split('.')[-1][0]# 提前准备好保存路径save_path = f'{file_prefix}_clips'if not os.path.exists(save_path):os.makedirs(save_path)slicer = Slicer(sr=sr,threshold=-40,min_length=5000,min_interval=300,hop_size=10,max_sil_kept=1000)chunks = slicer.slice(audio)for i, chunk in enumerate(chunks):if len(chunk.shape) > 1:chunk = chunk.T # Swap axes if the audio is stereo.# Save sliced audio files with soundfile.soundfile.write(f'{save_path}/audio_{i}.wav', chunk, sr)input_files = ["your_path/audio1.wav","your_path/audio2.wav", ] audio_slicer(input_files)
3,语音识别 & 划分数据集
将上面切割的音频标注为文本:可以利用开源模型自动转换,比如 FunASR,Whisper-tiny,Whisper-base,OmniSenseVoice这些都可以。
import glob import timeimport pandas as pd import whisper import os def calculate_time(func):def wrapper(*args, **kwargs):start_time = time.time()result = func(*args, **kwargs)end_time = time.time()print(f"函数 {func.__name__} 运行时间: {end_time - start_time} 秒")return resultreturn wrapperclass WhisperASR:def __init__(self, model_path):self.LANGUAGES = {"en": "english","zh": "chinese",}self.model = whisper.load_model(model_path)@calculate_timedef transcribe(self, audio_file):result = self.model.transcribe(audio_file)return result["text"]def get_wav_files(directory):# 确保路径存在if not os.path.isdir(directory):raise ValueError(f"目录不存在: {directory}")# 获取所有 .wav 文件(不区分大小写)wav_files = glob.glob(os.path.join(directory, "*.wav"))return wav_filesif __name__ == "__main__":# 创建ASR对象并进行语音识别model_path = "base.pt" # 模型路径audio_files = get_wav_files("D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips")file_paths, transcriptions = [], []for audio_file in audio_files:asr = WhisperASR(model_path)texts = asr.transcribe(audio_file)text_concat = []for text in eval(texts):text_concat.append(text[2])text_concat = ','.join(text_concat)if text_concat[-1] != '。':text_concat += '。'file_paths.append(audio_file)# 注意,按需修改语言标识,如中文为[ZH],英语为[EN]transcriptions.append('[ZH]' + text_concat + '[ZH]')annot_df = pd.DataFrame({'file_path':file_paths, 'transcription':transcriptions})
准备模型训练用的训练集和验证集。
train_ratio = 0.85 train_len = int(len(annot_df) * train_ratio)# 获取划分后的元数据 train_metadata = annot_df.iloc[:train_len,:] test_metadata = annot_df.iloc[train_len:, :] print(f'Train:{len(train_metadata)}') print(f'Test:{len(test_metadata)}')# 保存结果 save_path = 'your_path' all_annot_path = f'{save_path}/canton_single_speaker_all_filelist.txt' train_annot_path = f'{save_path}/canton_single_speaker_train_filelist.txt' test_annot_path = f'{save_path}/canton_single_speaker_val_filelist.txt' annot_df.to_csv(all_annot_path, index=False, header=False, sep='|') train_metadata.to_csv(train_annot_path, index=False, header=False, sep='|') test_metadata.to_csv(test_annot_path, index=False, header=False, sep='|')
4,预处理数据
执行 VITS 项目中的 preprocess.py,手动修改参数即可:
parser.add_argument("--filelists", nargs="+", default=["D:\\PyCharmWorkSpace\\vits\\data\\data_final\\canton_single_speaker_train_filelist.txt", "D:\\PyCharmWorkSpace\\vits\\data\\data_final\\canton_single_speaker_val_filelist.txt"]) parser.add_argument("--text_cleaners", nargs="+", default=["chinese_cleaners"])
报错:opencc.OpenCC('jyutjyu')
下载:https://github.com/Keith-Hon/vits-cantonese,在 opencc 目录下。
报错:converter = opencc.OpenCC('zaonhe')
注释:注释掉上海话和客家话,网上没找到这俩文件的配置。
# from text.shanghainese import shanghainese_to_ipa # from text.cantonese import cantonese_to_ipa # text = re.sub(r'\[SH\](.*?)\[SH\]', lambda x: shanghainese_to_ipa(x.group(1)).replace('1', '˥˧').replace('5','˧˧˦').replace('6', '˩˩˧').replace('7', '˥').replace('8', '˩˨').replace('ᴀ', 'ɐ').replace('ᴇ', 'e')+' ', text) # text = re.sub(r'\[GD\](.*?)\[GD\]', lambda x: cantonese_to_ipa(x.group(1))+' ', text)
5,模型训练
首先,进入VITS项目内configs里训练用的配置文件 chinese_base.json,然后修改以下内容:
"training_files":"filelists/juzi_train_filelist.txt.cleaned", "validation_files":"filelists/juzi_val_filelist.txt.cleaned", # 如果是单个人说话,设置 0 "n_speakers": 0, # 如果已经预处理了,设置 true "cleaned_text": true
train 参数:
- log_interval:保存间隔;
- seed:种子数,相同的种子和数据集可以得出相同的模型参数;
- epoch:训练多少个循环;
- learning_rate:学习率;
- batch_size:批大小,建议参考自己的显存大小设置,示例,设置24可以吃掉近10G显存;
- fp16_run:是否使用半精度进行训练,使用可以节约显存并适当加快速度,具体参照自己显卡的fp16算力;
data 参数:
- training_files:用于训练语音文本文件;
- validation_files:用于评估效果的语音文本文件;
- text_cleaners:语音cleaners,选择对应于语音语言的clezners;
- n_speakers:语音包括的人数。单人需要填0;
- cleaned_text:数据是否经过预处理,如果经过处理填写true,没有做过处理则填入false。
speakers:填写人物的名字或者代号,如果只打算训练一名人物,就不需要填写speakers这行,可以直接删除;
symbols:填写一些非当前语言的文字,或者会涉及语气的标点符号。要跟 text_cleaner 内的 symbols 保持一致,且要求 Unicode 编码。
import demjson3# chinese_dialect_cleaners _pad = '_' _punctuation = ',.!?~…─' _letters = '#Nabdefghijklmnoprstuvwxyzæçøŋœȵɐɑɒɓɔɕɗɘəɚɛɜɣɤɦɪɭɯɵɷɸɻɾɿʂʅʊʋʌʏʑʔʦʮʰʷˀː˥˦˧˨˩̥̩̃̚ᴀᴇ↑↓∅ⱼ ' # Export all symbols: symbols = [_pad] + list(_punctuation) + list(_letters)moetts_symbols_text = demjson3.encode(symbols) print(moetts_symbols_text)
【单人训练】
python train.py -c D:\PyCharmWorkSpace\vits\configs\chinese_base.json -m D:\PyCharmWorkSpace\vits\model
【多人训练】
python train_ms.py -c <config> -m <model_save_folder>
【修改】
- 添加环境变量
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"
- 把nccl改为gloo:
dist.init_process_group(backend='nccl', init_method='env://', world_size=n_gpus, rank=rank)
【报错】raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'object'
【解决】提高 tensorboard==2.3.0 的版本,因为 2.3.0 版本依赖 numpy 1.24 之前的。
【报错】ModuleNotFoundError: No module named 'monotonic_align.core'
- 执行 setup.py
python setup.py build_ext --inplace
- 继续报错:core.c(208): fatal error C1083: 无法打开包括文件: “longintrepr.h”: No such file or directory,这个报错处理极其复杂,直接用编译好的:
pip install monotonic-align
【报错】RuntimeError: use_libuv was requested but PyTorch was build without libuv support
os.environ["USE_LIBUV"] = "0"
【报错】IndexError: Replacement index 2 out of range for positional args tuple
修改:data_utils.TextAudioLoader.get_audio
raise ValueError("{} {} SR doesn't match target {} SR".format(sampling_rate, self.sampling_rate)) 👇 raise ValueError("{} {} SR doesn't match target SR".format(sampling_rate, self.sampling_rate))
【报错】ValueError: 8000 22050 SR doesn't match target SR,采样率不匹配!
#!/bin/bash# 设置输入和输出目录 INPUT_DIR="D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips" OUTPUT_DIR="D:\\PyCharmWorkSpace\\vits\\data\\20200327_2P_lenovo_iphonexr_66902_clips_22050"# 创建输出目录(如果不存在) mkdir -p "$OUTPUT_DIR"# 遍历输入目录中的所有 .wav 文件 for input_file in "$INPUT_DIR"/*.wav; do# 获取不带路径和扩展名的文件名filename=$(basename "$input_file" .wav)# 构建输出文件路径output_file="$OUTPUT_DIR/${filename}.wav"# 执行 ffmpeg 转换ffmpeg -i "$input_file" -ar 22050 "$output_file" done
【报错】RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
#修改mel_processing.py文件 spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size,window=hann_window[str(y.device)], center=center,pad_mode='reflect', normalized=False,onesided=True,return_complex=False)
【报错】AttributeError: 'FigureCanvasAgg' object has no attribute 'tostring_rgb'
pip install matplotlib==3.7.0
【报错】parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`
# train.py 104行左右 加入find_unused_parameters=True避免报错 net_g = DDP(net_g, device_ids=[rank],find_unused_parameters=True) net_d = DDP(net_d, device_ids=[rank],find_unused_parameters=True)
6,模型推理
执行:inference.ipynb,路径都用绝对路径!
utils.load_checkpoint("/path/to/model.pth", net_g, None) 👇 utils.load_checkpoint("/path/to/G_0.pth", net_g, None) # G_0.pth or D_0.pth 都可