当前位置: 首页 > ops >正文

架构轻巧的kokoro 文本转语音模型

Kokoro是一个具有8200万个参数的开放权重TTS模型。尽管其架构轻巧,但它提供了与较大型号相当的质量,同时速度更快,更具成本效益。使用Apache许可的权重,Kokoro可以部署在从生产环境到个人项目的任何地方。

官网:hexgrad/kokoro: https://hf.co/hexgrad/Kokoro-82M

现在我们来实践下Kokoro

Linux下安装使用

安装库

pip install -q kokoro>=0.8.2 "misaki[zh]>=0.8.2" soundfile

一键执行安装使用 

为了简单,可以学习官网,直接在kaggle或colab的notebook里,输入下面语句,运行即可:

!pip install -q kokoro>=0.9.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):print(i, gs, ps)display(Audio(data=audio, rate=24000, autoplay=i==0))sf.write(f'{i}.wav', audio, 24000)

这些语句包括pip 安装kokoro 和soundfile这两个python包,使用apt 安装了espeak-ng这个软件包(在Ubuntu下) 。

将要转语音的文字赋值给text变量,然后就可以进行文本转换了。

英文效果还行,但是无法混用数字,如果里面有中文就不行。

多种语言

安装中文库

pip install misaki[zh]

在kaggle或colab的notebook里的例子: 

# 1️⃣ Install kokoro
!pip install -q kokoro>=0.9.4 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice, reference above.# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(text, voice='af_heart', # <= change voice herespeed=1, split_pattern=r'\n+'
)
# Alternatively, load voice tensor directly:
# voice_tensor = torch.load('path/to/voice.pt', weights_only=True)
# generator = pipeline(
#     text, voice=voice_tensor,
#     speed=1, split_pattern=r'\n+'
# )for i, (gs, ps, audio) in enumerate(generator):print(i)  # i => indexprint(gs) # gs => graphemes/textprint(ps) # ps => phonemesdisplay(Audio(data=audio, rate=24000, autoplay=i==0))sf.write(f'{i}.wav', audio, 24000) # save each audio file

windows下安装

windows下主要是需要安装espeak-ng ,去这里下载:

​https://github.com/espeak-ng/espeak-ng/releases​

下载espeak-ng安装软件,安装即可。

ONNX部署

安装依赖库

!pip install -U kokoro-onnx soundfile 'misaki[zh]'

下载模型文件

!wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.1/kokoro-v1.1-zh.onnx
!wget     https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.1/voices-v1.1-zh.bin
!wget     https://huggingface.co/hexgrad/Kokoro-82M-v1.1-zh/raw/main/config.json

文本转换

import soundfile as sf
from misaki import zhfrom kokoro_onnx import Kokoro# Misaki G2P with espeak-ng fallback
g2p = zh.ZHG2P(version="1.1")text = "千里之行,始于足下。欢迎使用Kokoro TTS,高效生成自然语音!"
voice = "zf_001"
kokoro = Kokoro("kokoro-v1.1-zh.onnx", "voices-v1.1-zh.bin", vocab_config="config.json")
phonemes, _ = g2p(text)
samples, sample_rate = kokoro.create(phonemes, voice=voice, speed=1.0, is_phonemes=True)
sf.write("audio.wav", samples, sample_rate)
print("Created audio.wav")
display(Audio(data="audio.wav", rate=24000, autoplay=i==0))

推理速度还是较快的。

但是中文里面如果有英文,它是不会读出来的。所以效果还是略差一点。

总结

中文效果较差,英文效果还凑活。

当然kokoro 只有82M大小,能有这个效果已经很不错了!

http://www.xdnf.cn/news/14653.html

相关文章:

  • Apipost 和 Apifox 2025最新功能对比分析
  • 2-深度学习挖短线股-1-股票范围选择
  • [3D-portfolio] 版块包装高阶组件(封装到HOC) | Email表单逻辑 | 链式调用
  • 桌面小屏幕实战课程:DesktopScreen 11 SPI 水墨屏
  • 基于SpringBoot和Leaflet的区域冲突可视化-以伊以冲突为例
  • Robyn高性能Web框架系列06:使用WebSocket实现产品智能助理
  • SQL学习笔记3
  • 图像质量对比感悟
  • 智表ZCELL产品V3.2 版发布,新增拖动调整行列功能,修复了插件引用相对路径等问题
  • 【C++11】右值引用和移动语义
  • Hive3.1.3加载paimon-hive-connector-3.1-1.1.1.jar报错UnsatisfiedLinkError
  • 解决uniapp vue3版本封装组件后:deep()样式穿透不生效的问题
  • 【攻防篇】解决:阿里云docker 容器中自动启动xmrig挖矿
  • 超实用AI工具分享——ViiTor AI视频配音功能教程(附图文)
  • php项目部署----------酒店项目
  • 知攻善防应急靶机 Windows web 3
  • LVS-DR负载均衡群集深度实践:高性能架构设计与排障指南
  • 笔记02:布线-差分对的设置与添加
  • Liunx操作系统笔记2
  • 《解锁前端潜力:自动化流程搭建秘籍》
  • Boosting:从理论到实践——集成学习中的偏差征服者
  • linux-修改文件命令(补充)
  • Jenkins Pipeline 与 Python 脚本之间使用环境变量通信
  • 数的三次方根
  • 【深度学习新浪潮】空间计算的医疗应用技术分析(简要版)
  • TCP/UDP协议深度解析(二):TCP连接管理全解,三次握手四次挥手的完整流程
  • Linux docker拉取镜像报错解决
  • 空间理解模型 SpatialLM 正式发布首份技术报告
  • 数据结构 顺序表与链表
  • 一步部署APache编译安装脚本