当前位置：首页 > news >正文

【Agent开发】部署IndexTTS

news 2025/9/6 15:31:49

我们需要让Agent有“说话”的能力，因此我们部署一个TTS。这里我使用bilibili团队开发的IndexTeam/IndexTTS-1.5。

TTS 部署

项目地址：https://github.com/index-tts/index-tts
模型地址：https://modelscope.cn/models/IndexTeam/IndexTTS-1.5/

考虑到huggingface下载的缓慢，我上面放的是modelscope的地址，可以使用他们的命令行下载，省去我在宿主机下载完还要拷贝进WSL的问题。

在下载前，请先通过如下命令安装ModelScope

pip install modelscope

然后通过如下命令下载：

modelscope download --model IndexTeam/IndexTTS-1.5 --local_dir ./IndexTTS-1.5

接下来的操作，我使用基于WSL的Ubuntu 22.04完成，我的显卡为RTX4070，cuda版本如下：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_01:18:48_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

首先克隆项目到本地：

git clone https://github.com/index-tts/index-tts.git

为TTS运行准备一个虚拟环境，我不喜欢conda，我就直接用pip的venv了：

python3 -m venv tts

激活环境：

source tts/bin/activate

然后安装 torchaudio ，我这里从清华源安装

pip install torchaudio -i https://pypi.tuna.tsinghua.edu.cn/simple

切换到项目的根目录下，安装index-tts：

pip install -e .

安装完成就一切准备就绪了。

这里额外说下cuda的问题，我的Windows上是安装了CUDA的Toolkit的：
windows
但我通过WSL安装的Ubuntu里是没有安装的：
wsl
这不影响IndexTTS调用我的CUDA，之前我运行Qwen3的时候也没有影响，我也不清楚为什么。不过既然现在能用我就不动了。

运行

首先需要将下载的文件挪到项目目录下的checkpoints里，结构大概是这样的：

└── index-tts├── DISCLAIMER├── INDEX_MODEL_LICENSE├── LICENSE├── MANIFEST.in├── README.md├── assets│   ├── IndexTTS.png│   ├── img.png│   └── index_icon.png├── checkpoints│   ├── README│   ├── README.md│   ├── README.md:Zone.Identifier│   ├── README:Zone.Identifier│   ├── bigvgan_discriminator.pth│   ├── bigvgan_discriminator.pth:Zone.Identifier│   ├── bigvgan_generator.pth│   ├── bigvgan_generator.pth:Zone.Identifier│   ├── bpe.model│   ├── bpe.model:Zone.Identifier│   ├── config.yaml│   ├── config.yaml:Zone.Identifier│   ├── configuration.json│   ├── configuration.json:Zone.Identifier│   ├── dvae.pth│   ├── dvae.pth:Zone.Identifier│   ├── gitattributes│   ├── gitattributes:Zone.Identifier│   ├── gpt.pth│   ├── gpt.pth:Zone.Identifier│   ├── unigram_12000.vocab│   └── unigram_12000.vocab:Zone.Identifier

然后需要在项目目录下新建test_data，里面放一个input.wav作为参考音频。音频的采样率和大小没有要求。如果你暂时找不到音频，可以用tests目录下的sample_prompt.wav

然后在项目根目录执行：

python indextts/infer.py

代码里使用了不少相对定位，因此如果你换个目录执行，文件的位置也许要改。具体可以看代码。
这样你就生成了语音文件了，在项目目录下的gen.wav

生成的速度大概如下：

>> Reference audio length: 5.45 seconds
>> gpt_gen_time: 2.34 seconds
>> gpt_forward_time: 0.03 seconds
>> bigvgan_time: 0.25 seconds
>> Total inference time: 2.80 seconds
>> Generated audio length: 2.82 seconds
>> RTF: 0.9936
>> wav file saved to: gen.wav