当前位置: 首页 > news >正文

聊聊Spring AI Alibaba的YoutubeDocumentReader

本文主要研究一下Spring AI Alibaba的YoutubeDocumentReader

YoutubeDocumentReader

community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/main/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReader.java

public class YoutubeDocumentReader implements DocumentReader {private static final String WATCH_URL = "https://www.youtube.com/watch?v=%s";private final ObjectMapper objectMapper;private static final List<String> YOUTUBE_URL_PATTERNS = List.of("youtube\\.com/watch\\?v=([^&]+)","youtu\\.be/([^?&]+)");private final String resourcePath;private static final int MEMORY_SIZE = 5;private static final int BYTE_SIZE = 1024;private static final int MAX_MEMORY_SIZE = MEMORY_SIZE * BYTE_SIZE * BYTE_SIZE;private static final WebClient WEB_CLIENT = WebClient.builder().defaultHeader("Accept-Language", "en-US").codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(MAX_MEMORY_SIZE)).build();public YoutubeDocumentReader(String resourcePath) {Assert.hasText(resourcePath, "Query string must not be empty");this.resourcePath = resourcePath;this.objectMapper = new ObjectMapper();}@Overridepublic List<Document> get() {List<Document> documents = new ArrayList<>();try {String videoId = extractVideoIdFromUrl(resourcePath);String subtitleContent = getSubtitleInfo(videoId);documents.add(new Document(StringEscapeUtils.unescapeHtml4(subtitleContent)));}catch (IOException e) {throw new RuntimeException("Failed to load document from Youtube: {}", e);}return documents;}// Method to extract the videoId from the resourcePathpublic String extractVideoIdFromUrl(String resourcePath) {for (String pattern : YOUTUBE_URL_PATTERNS) {Pattern regexPattern = Pattern.compile(pattern);Matcher matcher = regexPattern.matcher(resourcePath);if (matcher.find()) {return matcher.group(1); // Extract the videoId (captured group)}}throw new IllegalArgumentException("Invalid YouTube URL: Unable to extract videoId.");}public String getSubtitleInfo(String videoId) throws IOException {// Step 1: Fetch the HTML content of the YouTube video pageString url = String.format(WATCH_URL, videoId);String htmlContent = fetchHtmlContent(url).block(); // Blocking for simplicity in// this example// Step 2: Extract the subtitle tracks from the HTMLString captionsJsonString = extractCaptionsJson(htmlContent);if (captionsJsonString != null) {JsonNode captionsJson = objectMapper.readTree(captionsJsonString);JsonNode captionTracks = captionsJson.path("playerCaptionsTracklistRenderer").path("captionTracks");// Check if captionTracks exists and is an arrayif (captionTracks.isArray()) {// Step 3: Extract and decode each subtitle track's URLStringBuilder subtitleInfo = new StringBuilder();JsonNode captionTrack = captionTracks.get(0);// Safely access languageCode and baseUrl with null checksString language = captionTrack.path("languageCode").asText("Unknown");String urlEncoded = captionTrack.path("baseUrl").asText("");// Decode the URL to avoid \u0026 issuesString decodedUrl = URLDecoder.decode(urlEncoded, StandardCharsets.UTF_8);String subtitleText = fetchSubtitleText(decodedUrl);subtitleInfo.append("Language: ").append(language).append("\n").append(subtitleText).append("\n\n");return subtitleInfo.toString();}else {return "No captions available.";}}else {return "No captions data found.";}}private Mono<String> fetchHtmlContent(String url) {// Use WebClient to fetch HTML content asynchronouslyreturn WEB_CLIENT.get().uri(url).retrieve().bodyToMono(String.class);}private String extractCaptionsJson(String htmlContent) {// Extract the captions JSON from the HTML contentString marker = "\"captions\":";int startIndex = htmlContent.indexOf(marker);if (startIndex != -1) {int endIndex = htmlContent.indexOf("\"videoDetails", startIndex);if (endIndex != -1) {String captionsJsonString = htmlContent.substring(startIndex + marker.length(), endIndex);return captionsJsonString.trim();}}return null;}private String fetchSubtitleText(String decodedUrl) throws IOException {// Fetch the subtitle text by making a request to the decoded subtitle URLorg.jsoup.nodes.Document doc = Jsoup.connect(decodedUrl).get();// Assuming the subtitle text is inside <transcript> tags, extract the textStringBuilder subtitleText = new StringBuilder();doc.select("text").forEach(textNode -> {String text = textNode.text();subtitleText.append(text).append("\n");});return subtitleText.toString();}}

YoutubeDocumentReader构造器要求输入resourcePath,它内置了WebClient,其get方法先通过extractVideoIdFromUrl获取videoId,再通过getSubtitleInfo获取字幕,最后组装为List<Document>返回;getSubtitleInfo通过请求https://www.youtube.com/watch?v=videoId,之后解析html内容获取videoDetails内容,再json解析提取language、subtitleText

示例

community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/test/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReaderTest.java

public class YoutubeDocumentReaderTest {private static final Logger logger = LoggerFactory.getLogger(YoutubeDocumentReaderTest.class);@Testvoid youtubeDocumentReaderTest() {YoutubeDocumentReader youtubeDocumentReader = new YoutubeDocumentReader("https://www.youtube.com/watch?v=q-9wxg9tQRk");List<Document> documents = youtubeDocumentReader.get();logger.info("documents: {}", documents);}}

小结

spring-ai-alibaba-starter-document-reader-youtube提供了YoutubeDocumentReader,它通过webClient去请求指定url,提取字幕的language以及字幕内容,最后组装为List<Document>返回。

doc

  • java2ai
http://www.xdnf.cn/news/142777.html

相关文章:

  • 从零开始掌握Linux数据流:管道与重定向完全指南
  • 【计算机视觉】CV实战 - 基于YOLOv5的人脸检测与关键点定位系统深度解析
  • BT150-ASEMI机器人率器件专用BT150
  • G1垃圾回收器中YoungGC和MixedGC的区别
  • HarmonyOS NEXT应用开发-Notification Kit(用户通知服务)notificationManager.addSlot
  • POI从入门到上手(一)-轻松完成Apache POI使用,完成Excel导入导出.
  • 【滑动窗口+哈希表/数组记录】Leetcode 438. 找到字符串中所有字母异位词
  • 《100天精通Python——基础篇 2025 第3天:变量与数据类型全面解析,掌握Python核心语法》
  • 基于大模型对先天性巨结肠全流程预测及医疗方案研究报告
  • ​升级Ubuntu 20.04 LTS到22.04 LTS​
  • Python 教程:我们可以给 Python 文件起中文名吗?
  • EDI 如何与 ERP,CRM,WMS等系统集成
  • 各类前端开发的框架比较及其核心特性、开发体验、生态系统以及在不同项目中的适用性
  • AUTOSAR图解==>AUTOSAR_SWS_SAEJ1939TransportLayer
  • 每日c/c++题 备战蓝桥杯(P1049 [NOIP 2001 普及组] 装箱问题)
  • PostgreSQL 漏洞信息详解
  • DAX Studio将PowerBI与EXCEL连接
  • 【遥感图像分类】【综述】遥感影像分类:全面综述与应用
  • 广州 3D 展厅开启企业展示新时代​
  • SecMulti-RAG:兼顾数据安全与智能检索的多源RAG框架,为企业构建不泄密的智能搜索引擎
  • python如何取消word中的缩进
  • 深入解析 SMB 相关命令:smbmap、smbclient、netexec 等工具的使用指南
  • 如何在Linux用libevent写一个聊天服务器
  • 基于多技术栈的数学问题求解系统设计与实现
  • Winform(1.Winform控件学习)
  • Java—数 组
  • Unity 打包后 无阴影 阴影不显示
  • 通过音频的pcm数据格式利用canvas绘制音频波形图
  • 设计模式-- 原型模式详解
  • 为什么栈内存比堆内存速度快?