当前位置：首页 > news >正文

Weaviate使用入门：从零搭建向量数据库的完整指南

news 2025/8/29 11:11:58

一、Weaviate简介与核心优势

Weaviate是一款开源向量搜索引擎，专为存储和检索高维向量数据设计，支持文本、图像等多种媒体类型。其核心功能包括语义搜索、问答提取、分类等，具备以下独特优势：

低延迟：毫秒级响应时间，适用于实时场景。
灵活扩展：支持数十亿级数据对象，模块化架构可集成自定义模型（如PyTorch、TensorFlow）。
多模态支持：适配文本、图像、音视频等多种数据类型。
云原生设计：提供GraphQL和REST API，无缝对接现有技术栈（如LangChain）。

二、环境搭建与部署

1. Docker快速部署

通过Docker Compose一键启动Weaviate服务：

# docker-compose.yml
version: '3.4'
services:weaviate:image: semitechnologies/weaviate:latestports:- "8080:8080"- "50051:50051"  # gRPC端口（可选）environment:AUTHENTICATION_APIKEY_ENABLED: "true"AUTHENTICATION_APIKEY_ALLOWED_KEYS: "your-api-key"

启动命令：

docker-compose up -d

2. Python客户端初始化

安装SDK并连接数据库：

import weaviate
client = weaviate.Client(url="http://localhost:8080",auth_client_secret=weaviate.AuthApiKey("your-api-key")
)

三、数据建模与Schema定义

1. 创建数据类（Class）

示例：构建一个存储技术文章的Schema：

schema = {"class": "Article","properties": [{"name": "title", "dataType": ["text"]},{"name": "content", "dataType": ["text"]},{"name": "tags", "dataType": ["text[]"]}],"vectorizer": "text2vec-transformers",  # 指定向量化模型"vectorIndexConfig": {"distance": "cosine"  # 相似度计算方式（可选：l2、dot等）}
}
client.schema.create_class(schema)

2. 自定义向量化模块

若需集成自定义模型（如Hugging Face模型），可在Docker配置中添加模块：

services:t2v-transformers:image: soulteary/t2v-transformers:2024.06.27ports:- "9090:8080"

四、数据导入与向量生成

1. 单条数据插入

data_object = {"title": "Weaviate入门指南","content": "本文介绍如何快速搭建Weaviate向量数据库...","tags": ["数据库", "AI"]
}
client.data_object.create(data_object, "Article")

2. 批量导入CSV数据

结合Pandas处理结构化数据：

import pandas as pd
df = pd.read_csv("articles.csv")
for _, row in df.iterrows():client.data_object.create({"title": row["title"],"content": row["content"],"tags": row["tags"].split(",")}, "Article")

五、向量检索实战

1. 基础语义搜索

response = client.query\.get("Article", ["title", "content"])\.with_near_text({"concepts": ["机器学习"]})\.with_limit(5)\.do()for item in response["data"]["Get"]["Article"]:print(f"标题: {item['title']}\n内容摘要: {item['content'][:100]}...")

2. 混合查询（向量+结构化过滤）

client.query\.get("Article", ["title"])\.with_where({"path": ["tags"],"operator": "ContainsAny","valueText": ["AI"]})\.with_near_vector({"vector": [0.1, -0.2, 0.5]})\.with_additional(["distance"])