当前位置：首页 > news >正文

本地化AI问答：告别云端依赖，用ChromaDB + HuggingFace Transformers 搭建离线RAG检索系统

news 2025/9/4 15:07:06

在人工智能（AI）飞速发展的浪潮中，大型语言模型（LLMs）以其惊人的文本理解和生成能力，重塑了信息交互的方式。然而，要让 LLMs 能够回答特定领域、特定知识库的问题，并保证答案的时效性和准确性，检索增强生成（Retrieval Augmented Generation, RAG）成为了关键技术。

传统的 RAG 系统通常依赖于：

外部知识库：存储需要 LLM 学习的文本数据。

向量数据库：用于存储文本的向量嵌入（Embeddings），实现高效的语义搜索。

LLM：负责根据检索到的上下文生成最终答案。

在实际应用中，许多向量数据库（如 Pinecone, Weaviate, Milvus）通常以云服务或需要独立部署服务器的形式存在，这带来了一些潜在的挑战：

数据隐私与安全：敏感数据需要上传到云端，可能引发隐私担忧。

成本：云服务的托管和查询费用可能随着使用量增加而变得昂贵。

离线能力：无法在没有互联网连接的环境下使用。

延迟：云服务可能引入额外的网络延迟。

对于许多个人开发者、小型团队或对数据隐私有极高要求的企业而言，本地化部署 RAG 系统是更优的选择。本文将深入探讨如何使用ChromaDB作为向量数据库，配合 HuggingFace 的 SentenceTransformers 进行文本嵌入，以及利用 HuggingFace Transformers 库中本地运行的 LLMs（如 GPT-2 或更强大的量化模型），构建一个完整的离线 RAG 问答系统。

我们将从 RAG 的基本原理出发，详细介绍 ChromaDB 的特性及其在本地部署的优势，并提供一个端到端的 Python 代码示例，手把手教你搭建一个可以离线回答问题的智能问答机器人。

引言：为何选择本地化 RAG？— 隐私、成本与离线能力

1.1 RAG 核心原理简述

1.2 云端 RAG 的挑战

1.3 本地化 RAG 的优势

1.4 ChromaDB 与 HuggingFace: 本地化 RAG 的理想组合

1.5 本文目标与大纲

RAG 系统核心组件解析

2.1 数据预处理器（Document Loader & Text Splitter）

2.2 嵌入模型（Embedding Model）

2.3 向量数据库（Vector Store）

2.4 LLM（Generator）

ChromaDB：轻量级、本地化的向量数据库

3.1 ChromaDB 简介与核心特性

3.2 ChromaDB 的优势：易用性、Python 原生、持久化

3.3 与 Pinecone 等云服务的对比

HuggingFace 生态：离线嵌入与 LLM 的强大支持

4.1 SentenceTransformers：高效的文本嵌入生成

4.2 HuggingFace Transformers Pipelines：轻松部署本地 LLM

实战：构建离线 RAG 问答系统

5.1 环境准备与库安装

5.2 步骤一：准备知识库与数据加载

5.3 步骤二：文本分割（Chunking）

5.4 步骤三：文本嵌入生成

5.5 步骤四：初始化 ChromaDB 并构建向量索引

5.6 步骤五：部署本地 LLM (HuggingFace Transformers)

5.7 步骤六：构建 RAG Chain (LangChain)

5.8 运行与演示

性能优化与进阶思考

6.1 embedding model 的选择

6.2 文本分割策略优化

6.3 ChromaDB 持久化与管理

6.4 LLM 模型选择与量化

6.5 真实业务场景考量

结论：拥抱本地 AI 的无限可能

1. 引言：为何选择本地化 RAG？— 隐私、成本与离线能力

1.1 RAG 核心原理简述

RAG 的核心思想是“先检索，后生成”。它通过将 LLM 的生成过程与一个外部知识库（如公司内部文档、技术手册、知识库等）进行“链接”，从而弥补了 LLM 知识陈旧、容易“一本正经地胡说八道”（幻觉）的缺陷。其流程大致为：用户提问 -> 检索器从知识库中找出相关信息 -> LLM 结合检索到的信息和用户问题生成答案。

1.2 云端 RAG 的挑战

虽然云端向量数据库（如 Pinecone）提供了易于扩展、高性能的服务，并简化了部署流程，但它们也带来了显著的挑战：

数据安全与隐私：将私有或敏感文档上传至第三方云服务，需要高度信任其安全保障措施，且可能不符合合规要求。

成本：随着数据量、查询频率的增加，云服务的托管和 API 调用费用会持续累积。

网络依赖：系统的可用性完全依赖于稳定的网络连接。

控制力：在数据处理、模型选择、部署配置等方面，用户对云端服务的控制力相对有限。

1.3 本地化 RAG 的优势

本地化部署 RAG 系统，能够显著缓解上述问题：

数据隐私保护：所有数据（原始文档、嵌入向量、LLM 模型）均保留在本地，完全掌控，无需上传到任何第三方服务器。

成本效益：无需支付云服务费用。主要的成本投入是硬件（CPU/GPU）和一次性的模型下载。

离线可用性：一旦部署完成，系统即可在没有互联网连接的情况下正常工作。

低延迟：省去了网络传输和云端处理的中间环节，查询响应速度更快。

完全控制：用户可以自由选择和配置嵌入模型、向量数据库、LLM，以及调整 RAG 系统的各项参数。

1.4 ChromaDB 与 HuggingFace: 本地化 RAG 的理想组合

要在本地构建功能完善的 RAG 系统，我们需要可靠的组件：

向量数据库： ChromaDB 以其“Python-native”的特性、易用性、轻量级和支持持久化存储而备受青睐，非常适合本地和中小型项目。它避免了部署和管理独立数据库服务器的复杂性。

嵌入模型： HuggingFace 的 SentenceTransformers 库提供了大量预训练模型，如 all-MiniLM-L6-v2，它们能在 CPU 或 GPU 上高效生成高质量的文本嵌入，是本地 RAG 的理想选择。

LLM： HuggingFace 的 transformers 库允许我们轻松下载并运行各种开源 LLMs（如 GPT-2, Llama, Mistral 的小型版本甚至量化版本），为 RAG 提供了本地的“Generator”。

1.5 本文目标与大纲

本文的目标是指导读者如何利用 ChromaDB 和 HuggingFace 生态，实现一个功能完备的本地离线 RAG 问答系统。我们将覆盖：

RAG 系统的核心概念。

ChromaDB 和 HuggingFace 生态的优势。

搭建本地 RAG 系统的完整步骤，包括数据加载、嵌入、向量存储、本地 LLM 集成以及 RAG Chain 的构建。

提供一个可运行的 Python 代码示例。

讨论性能优化和进阶玩法。

2. RAG 系统核心组件解析

在开始构建之前，快速回顾 RAG 的关键组成部分非常重要：

2.1 数据预处理器（Document Loader & Text Splitter）：

Document Loader：负责从各种数据源（如 .txt, .pdf, .docx, 网页, 数据库）加载原始数据。LangChain 提供了丰富的 Document Loader。

Text Splitter：大型文档需要被分割成较小的、语义连贯的“文档块”（Chunks），以便嵌入模型处理，并提高检索的精确度。常用的有 RecursiveCharacterTextSplitter，它能根据文本结构（如特定字符、段落、句子）灵活地分割文本。

2.2 嵌入模型（Embedding Model）：

将文本（或其他模态，如图像）转换成高维向量（Embeddings），这些向量捕捉了文本的语义信息。

对于本地 RAG，SentenceTransformers 库提供了许多高效且易于本地运行的模型，如 all-MiniLM-L6-v2，在速度和质量之间取得了良好平衡。

2.3 向量数据库（Vector Store）：

用于存储由嵌入模型生成的文本向量，并提供高效的相似性搜索功能（如 K-Nearest Neighbors, KNN）。

在本地场景下，ChromaDB 是一个极佳的选择，它支持内存模式或持久化存储，且无需单独的后端服务。

2.4 LLM（Generator）：

接收用户原始问题和检索到的相关文本块，然后根据这些信息生成结构化、有意义的答案。

对于本地化，我们可以使用 HuggingFace transformers 库加载各类开源 LLMs。

3. ChromaDB：轻量级、本地化的向量数据库

3.1 ChromaDB 简介与核心特性

ChromaDB 是一个为 AI 工作负载设计的开源嵌入式数据库。它的设计理念是“简单即是力量”，让开发者能够轻松地将嵌入向量存储和检索集成到他们的应用中。

嵌入式： ChromaDB 可以作为 Python 库直接嵌入到你的应用中，无需独立的服务器进程。它支持内存模式，也支持将数据持久化到磁盘。

Python 原生：提供简洁的 Python API，与 LangChain 等框架集成非常方便。

开箱即用：对于小型到中型数据集，无需复杂的配置即可开始使用。

持久化：可以选择将数据存储在指定的目录下，实现数据的持久化，避免每次重启应用时都要重新索引。

多种距离/相似度指标：支持 L2 (Euclidean), Inner Product, Cosine Similarity。

3.2 ChromaDB 的优势：易用性、Python 原生、持久化

易用性: 它的 API 设计非常直观，创建集合、添加文档、执行查询等操作都极其简单。

Python 原生: 作为 Python 库，它与 Python 生态系统无缝集成，尤其适合与 LangChain、SentenceTransformers 等库配合使用。

持久化: 这是其一大关键优势。通过指定 persist_directory，ChromaDB 会将所有数据（元数据、文本、嵌入向量）保存到指定目录，下次加载时可以直接恢复，无需重新建立索引，极大提高了开发效率和应用稳定性。

3.3 与 Pinecone 等云服务的对比

特性

ChromaDB (本地)

Pinecone (云端)

部署模式

嵌入式库，内存或本地文件持久化

托管云服务，需API交互

数据位置

完全本地

云端

成本

硬件成本，一次性购买（如GPU）；无服务费

按用量付费（存储、查询、节点数）

隐私安全

最高（数据不外泄）

依赖服务商的信任和安全策略

离线能力

完全支持

无

易用性

非常高（Python原生API）

高（但涉及API密钥、服务管理）

Scalability

有限（受限于本地硬件）；适合中小型数据集

非常高，为大规模生产环境设计

延迟

极低（本地直接计算）

受网络条件及云端负载影响

配置复杂性

低

中等（需要管理API key, 实例等）

适用场景

个人项目、小型应用、原型开发、对隐私要求极高场景

大型企业应用、需要高可用性和大规模扩展的场景

对于“离线”和“本地化”的需求，ChromaDB 是显而易见的优选。

4. HuggingFace 生态：离线嵌入与 LLM 的强大支持

4.1 SentenceTransformers：高效的文本嵌入生成

sentence-transformers 是一个用于生成句子和文本块的语义相似度嵌入的 Python 库。它封装了许多预训练的 Transformer 模型，并提供了简单的 API 来加载和使用这些模型。

优势：

数量庞大：提供大量预训练模型，可根据场景和硬件资源选择。

高效：针对生成句子嵌入进行了优化。

易用： API 简洁，易于与 LangChain 等框架集成。

本地运行：模型可以下载到本地，完全离线运行。

4.2 HuggingFace Transformers Pipelines：轻松部署本地 LLM

HuggingFace 的 transformers 库是集成和运行各种语言模型的瑞士军刀。其 pipeline API 抽象了模型加载、分词、模型推理、后处理等复杂过程，使得部署一个本地 LLM 变得异常简单。

优势：

模型丰富：支持从几亿参数到千亿参数的各种模型。

灵活：可根据硬件（CPU/GPU）选择合适的模型和精度（如 FP32, FP16, INT8/INT4 量化）。

易于使用：几行代码即可加载和运行一个模型进行文本生成、问答等任务。

离线部署：模型文件可下载到本地，完全离线运行。

5. 实战：构建离线 RAG 问答系统

我们将使用 LangChain 来编排整个 RAG 流程，它提供了方便的接口来集成 SentenceTransformers、ChromaDB 和 HuggingFace Transformers。

5.1 环境准备与库安装

首先，安装必要的 Python 库：

<BASH>

pip install langchain sentence-transformers chromadb datasets transformers torch pypdf # datasets and pypdf for easier data loading/handling

langchain: RAG orchestrator.

sentence-transformers: For generating text embeddings.

chromadb: The local vector store.

torch: PyTorch backend for HuggingFace models.

transformers: To load and run local LLMs.

datasets: Useful even for local text files, can abstract loading.

pypdf: For loading PDF files if needed (for this example, we'll use a simple text file).

5.2 步骤一：准备知识库与数据加载

为了演示，我们创建一个简单的文本文件 knowledge_base.txt 作为我们的知识源。

<TEXT>

# knowledge_base.txt

RAG (Retrieval Augmented Generation) is an AI framework that combines retrieval databases with generative language models. It improves the accuracy and relevance of LLM outputs by augmenting prompts with relevant external knowledge.

The core components of RAG are:

1. Document Loader: Loads data from various sources (text files, PDFs, web pages).

2. Text Splitter: Divides large documents into smaller, manageable chunks.

3. Embedding Model: Transforms text chunks into numerical vectors (embeddings).

4. Vector Store: Stores embeddings and provides efficient similarity search (e.g., ChromaDB, FAISS, Pinecone).

5. LLM: Generates coherent and relevant answers based on the query and retrieved context.

ChromaDB is an open-source, embeddable vector database. It's lightweight, Python-native, and supports data persistence, making it ideal for local RAG deployments where privacy and offline capability are crucial.

Persistent storage in ChromaDB allows you to save your index to disk, so you don't have to re-index data every time your application restarts. You specify a directory, and ChromaDB manages the storage.

HuggingFace's SentenceTransformers library offers a wide variety of pre-trained models for generating text embeddings. Models like 'all-MiniLM-L6-v2' are efficient and provide good quality for many tasks.

HuggingFace's Transformers library allows us to run LLMs locally. You can download models like GPT-2, Llama, or Mistral and use them as the 'Generator' component in your RAG pipeline without needing an internet connection or API keys.

5.3 步骤二：文本分割（Chunking）

我们使用 LangChain 的 RecursiveCharacterTextSplitter 将文本分割成小块。

from langchain_community.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# --- 1. Data Loading ---

def load_documents(file_path="knowledge_base.txt"):

loader = TextLoader(file_path)

# For PDFs, you'd use:

# from langchain_community.document_loaders import PyPDFLoader

# loader = PyPDFLoader(file_path)

return loader.load()

# --- 2. Text Splitting ---

def split_documents(documents):

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000, # Max characters per chunk

chunk_overlap=150, # Characters to overlap between chunks

length_function=len,

add_start_index=True,

)

return text_splitter.split_documents(documents)

# Load and split documents

print("Loading and splitting documents...")

documents = load_documents()

text_chunks = split_documents(documents)

# Print info about chunks

print(f"Loaded {len(documents)} documents.")

print(f"Split into {len(text_chunks)} text chunks.")

# print(f"First chunk: {text_chunks[0].page_content[:200]}...") # Preview first chunk

5.4 步骤三：文本嵌入生成

使用 SentenceTransformerEmbeddings from LangChain，这是一个封装了 sentence-transformers 库的类。

from langchain_community.embeddings import SentenceTransformerEmbeddings

# --- 3. Embedding Generation ---

# Initialize SentenceTransformer embedding model

# 'all-MiniLM-L6-v2' is a good balance of speed and quality for local use.

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

print(f"Embedding model loaded: {embeddings.model_name}")

5.5 步骤四：初始化 ChromaDB 并构建向量索引

我们将使用 ChromaDB 作[vector_store。Chroma 类来自 LangChain。

from langchain_community.vectorstores import Chroma

import os

# --- 4. Vector Store Setup (ChromaDB) ---

# Define the directory where ChromaDB will store its data

PERSIST_DIRECTORY = "chroma_db_offline"

# Initialize ChromaDB

# If the directory does not exist, Chroma will create it.

# If it exists, Chroma will load the existing index from there.

print(f"Initializing ChromaDB with persistence directory: {PERSIST_DIRECTORY}")

vector_store = Chroma.from_documents(

documents=text_chunks,

embedding=embeddings,

persist_directory=PERSIST_DIRECTORY # This makes the vector store persistent

)

# Optional: You can explicitly call persist if you are not using Language Chain's from_documents directly

# vector_store.persist()

# print("ChromaDB index persisted.")

# To load an existing persistent vector store:

# embeddings_for_loading = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# vector_store = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=embeddings_for_loading)

# Now, vector_store is ready for retrieval.

print(f"ChromaDB collection created/loaded with {vector_store._collection.count()} documents.")

5.6 步骤五：部署本地 LLM (HuggingFace Transformers)

我们将使用 HuggingFace 的 transformers 库加载一个小型 LLM（如 GPT-2）作为我们的“Generator”。pipeline API 使其非常方便。

from transformers import pipeline

from langchain_community.llms import HuggingFacePipeline

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

import torch

# --- 5. Local LLM Setup ---

# Choose a local LLM. GPT-2 is small and fast for demonstration.

# For better performance, you might want to use larger models like Mistral, Llama 2, etc.,

# potentially with quantization (e.g., bitsandbytes) if you have limited VRAM.

LLM_MODEL_NAME = "gpt2" # Or "gpt2-medium", "distilgpt2", etc.

print(f"Loading local LLM: {LLM_MODEL_NAME}...")

# Load the model and tokenizer

try:

tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)

# Use torch_dtype=torch.float16 for faster inference if GPU supports it

# Use device_map="auto" to automatically use GPU if available, else CPU

model = AutoModelForCausalLM.from_pretrained(LLM_MODEL_NAME,

torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,

device_map="auto")

except Exception as e:

print(f"Error loading model {LLM_MODEL_NAME}: {e}")

print("Please ensure you have enough disk space and RAM/VRAM. Trying with a smaller model like 'distilgpt2' might help.")

LLM_MODEL_NAME = "distilgpt2" # Fallback to a smaller model

print(f"Falling back to: {LLM_MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(LLM_MODEL_NAME,

torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,

device_map="auto")

# Create a text generation pipeline

# Potentially set generation parameters here like max_new_tokens, temperature, etc.

text_generation_pipeline = pipeline(

"text-generation",

model=model,

tokenizer=tokenizer,

max_new_tokens=250, # Limit the length of the LLM's answer

temperature=0.7,

do_sample=True, # Enable sampling for more varied answers

repetition_penalty=1.1

)

# Wrap the pipeline with LangChain's HuggingFacePipeline

local_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

print("Local LLM loaded and ready.")

5.7 步骤六：构建 RAG Chain (LangChain)

LangChain Expression Language (LCEL) 提供了一种灵活的方式来组合 RAG 的检索和生成部分。

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

from langchain_core.runnables import RunnablePassthrough

from langchain_core.output_parsers import StrOutputParser

# --- 6. Building the RAG Chain ---

# Define the prompt template

template = """

You are a helpful assistant that answers questions based on the provided context.

If you don't know the answer, just say you don't know.

Do not use your general knowledge. Only use the context below.

Context:

{context}

Question: {question}

Answer:

"""

prompt = PromptTemplate.from_template(template)

# Create the retriever from the ChromaDB vector store

# similarity_search_with_score_k = 3 means we fetch top 3 most relevant chunks

retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Build the LCEL chain

# 1. Pass the question directly to the retriever (RunnablePassthrough)

# 2. Pass the question and the retrieved context to the prompt

# 3. Parse the output of the LLM

rag_chain = (

{"context": retriever | RunnablePassthrough(lambda docs: "\n".join([doc.page_content for doc in docs])),

"question": RunnablePassthrough()}

| prompt

| local_llm

| StrOutputParser() # Parses the output for cleaner results

)

print("RAG Chain created.")

5.8 运行与演示

现在，我们可以提问并让系统回答了。

# --- 7. Querying the RAG System ---

print("\n--- Let's start the Q&A session (type 'quit' to exit) ---")

while True:

query = input("Your question: ")

if query.lower() == 'quit':

break

if query.strip() == "":

continue

try:

# The RAG chain takes the question string as input

answer = rag_chain.invoke(query)

print(f"Answer: {answer}\n")

except Exception as e:

print(f"An error occurred during query: {e}\n")

print("Exiting Q&A session. Local RAG system shut down.")

完整代码示例（main.py）：

import os

from langchain_community.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings import SentenceTransformerEmbeddings

from langchain_community.vectorstores import Chroma

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

from langchain_community.llms import HuggingFacePipeline

from langchain_core.prompts import PromptTemplate

from langchain_core.runnables import RunnablePassthrough

from langchain_core.output_parsers import StrOutputParser

import torch

# --- Configuration ---

KNOWLEDGE_BASE_FILE = "knowledge_base.txt"

PERSIST_DIRECTORY = "chroma_db_offline"

EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2" # SentenceTransformer model

LLM_MODEL_NAME = "gpt2" # HuggingFace LLM for generation

CHUNK_SIZE = 1000

CHUNK_OVERLAP = 150

TOP_K_RETRIEVAL = 3

# --- Helper Functions ---

def load_documents(file_path):

"""Loads documents from a text file."""

if not os.path.exists(file_path):

# Create a dummy file if it doesn't exist

print(f"Knowledge base file '{file_path}' not found. Creating a dummy one.")

dummy_content = """

The core components of RAG are:

1. Document Loader: Loads data from various sources (text files, PDFs, web pages).

2. Text Splitter: Divides large documents into smaller, manageable chunks.

3. Embedding Model: Transforms text chunks into numerical vectors (embeddings).

4. Vector Store: Stores embeddings and provides efficient similarity search (e.g., ChromaDB, FAISS, Pinecone).

5. LLM: Generates coherent and relevant answers based on the query and retrieved context.

"""

with open(file_path, "w", encoding="utf-8") as f:

f.write(dummy_content)

loader = TextLoader(file_path, encoding='utf-8')

return loader.load()

def split_documents(documents, chunk_size, chunk_overlap):

"""Splits documents into smaller chunks."""

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap,

length_function=len,

add_start_index=True,

)

return text_splitter.split_documents(documents)

def setup_embedding_model(model_name):

"""Initializes the SentenceTransformer embedding model."""

print(f"Loading embedding model: {model_name}...")

return SentenceTransformerEmbeddings(model_name=model_name)

def setup_chromadb(persist_directory, embedding_function, documents):

"""Initializes or loads ChromaDB vector store."""

print(f"Initializing/Loading ChromaDB from: {persist_directory}")

# Check if persist directory exists and has data, otherwise create/index fresh

if not os.path.exists(persist_directory) or not os.listdir(persist_directory):

print("Creating new ChromaDB collection...")

vector_store = Chroma.from_documents(

documents=documents,

embedding=embedding_function,

persist_directory=persist_directory

)

vector_store.persist() # Ensure data is written to disk

print("ChromaDB collection created and persisted.")

else:

print("Loading existing ChromaDB collection...")

vector_store = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)

print("ChromaDB collection loaded.")

print(f"ChromaDB collection has {vector_store._collection.count()} documents.")

return vector_store

def setup_local_llm(model_name):

"""Sets up a local LLM using HuggingFace Transformers."""

print(f"Loading local LLM: {model_name}...")

try:

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Attempt to load with half precision if GPU is available for faster inference

# Ensure device_map is 'auto' to utilize GPU if present

model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,

device_map="auto"

)

except Exception as e:

print(f"Error loading model {model_name}: {e}")

print("Falling back to a smaller model like 'distilgpt2' if available.")

model_name = "distilgpt2" # Fallback to a smaller model to ensure it runs

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(

model_name,

torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,

device_map="auto"

)

# Create a text generation pipeline

# Adjust max_new_tokens and other parameters as needed

text_generation_pipeline = pipeline(

"text-generation",

model=model,

tokenizer=tokenizer,

max_new_tokens=250,

temperature=0.7,

do_sample=True,

repetition_penalty=1.1,

# Add padding token if missing (many models don't have one by default)

pad_token_id=tokenizer.eos_token_id if tokenizer.pad_token is None else tokenizer.pad_token_id

)

return HuggingFacePipeline(pipeline=text_generation_pipeline)

def build_rag_chain(vector_store, llm):

"""Builds the RAG chain using LangChain Expression Language."""

# Create the retriever

retriever = vector_store.as_retriever(search_kwargs={"k": TOP_K_RETRIEVAL})

# Define the prompt template

template = """

You are a helpful AI assistant that answers questions based only on the provided context.

If the context does not contain the answer, please state that you don't know.

Avoid using any prior knowledge.

Context:

{context}

Question: {question}

Answer:

"""

prompt = PromptTemplate.from_template(template)

# Build the RAG chain

# Uses RunnablePassthrough to pass inputs directly or process them.

# The retriever is called first to get context.

# The question and context are then combined into the prompt.

# Finally, the prompt is passed to the LLM.

rag_chain = (

{"context": retriever | RunnablePassthrough(lambda docs: "\n".join([doc.page_content for doc in docs])),

"question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser() # Parses the output to a plain string

)

return rag_chain

# --- Main Execution ---

if __name__ == "__main__":

# 1. Load and split documents

documents = load_documents(KNOWLEDGE_BASE_FILE)

text_chunks = split_documents(documents, CHUNK_SIZE, CHUNK_OVERLAP)

# 2. Setup embedding model

embeddings = setup_embedding_model(EMBEDDING_MODEL_NAME)

# 3. Setup ChromaDB

vector_store = setup_chromadb(PERSIST_DIRECTORY, embeddings, text_chunks)

# 4. Setup local LLM

local_llm = setup_local_llm(LLM_MODEL_NAME)

# 5. Build RAG chain

rag_chain = build_rag_chain(vector_store, local_llm)

# --- Start Q&A Session ---

print("\n--- Local RAG Q&A Session Started ---")

print(f"Using LLM: '{LLM_MODEL_NAME}'. Type 'quit' to exit.")

while True:

query = input("Your question: ")

if query.lower() == 'quit':

break

if not query.strip():

continue

try:

# Invoke the RAG chain with the user's query

answer = rag_chain.invoke(query)

print(f"Answer: {answer}\n")

except Exception as e:

print(f"An error occurred during query: {e}\n")

print("Exiting Q&A session. Local RAG system shut down.")

6. 性能优化与进阶思考

6.1 Embedding Model 的选择：

all-MiniLM-L6-v2 是一个不错的起点，速度快且效果尚可。

对于更高质量的嵌入，可以尝试 SBERT 库中更大的模型，如 all-mpnet-base-v2。

注意：选择的模型越大，所需的内存和计算资源越多，每次运行 LLM 前生成查询嵌入的速度也可能变慢。

6.2 文本分割策略优化：

chunk_size 直接影响检索到的上下文的粒度和完整性。过小可能丢失上下文，过大可能包含太多无关信息，导致 LLM “溺水”。

chunk_overlap 确保了语义的连续性，可以帮助 LLM 在分割点附近找到完整的答案。

尝试不同的分割参数，找到最适合你知识库的设置。

6.3 ChromaDB 持久化与管理：

persist_directory：确保指定一个目录来保存数据。这个目录中的文件包含了所有向量、文本和元数据。

数据同步：如果你的知识库会动态更新，需要考虑如何重新生成嵌入并更新 ChromaDB。可以先删除旧目录，再重新索引，或者使用 ChromaDB 的更高级 API 来逐步更新。

版本控制：可以考虑将 persist_directory 作为 Git 仓库的一部分（但要注意文件大小），或使用版本控制工具来管理不同版本的数据索引。

6.4 LLM 模型选择与量化：

gpt2 是一个非常小的模型，适用于快速测试，但性能有限。

对于严肃应用，可以考虑 HuggingFace Hub 上其他更强大的模型，如 Llama 2, Mistral, Mixtral 的小型版本，或者经过优化的版本（如 GGML/GGUF 格式的模型，可以通过 llama-cpp-python 等库加载，通常需要安装额外的依赖）。

模型量化：通过 bitsandbytes 等库对模型进行 4-bit 或 8-bit 量化，可以大幅降低模型运行所需的显存（VRAM）和内存，并可能加速推理，但可能略微牺牲精度。

GPU 加速：如果有 NVIDIA GPU，确保 PyTorch 和 Transformers 都配置为使用 GPU（通过 device_map="auto" 和 torch_dtype=torch.float16），这会极大地提升 LLM 的推理速度。

6.5 真实业务场景考量：

预处理流水线：对于大量、复杂的数据源（如 PDF），需要更健壮的预处理流水线，可能包括 PDF 解析、OCR（光学字符识别）处理、多模态信息提取（如识别图表、关联图片与文字）等。

检索多样性：考虑从知识库中检索更丰富的上下文，可能来自不同的文档块，以提供更全面的答案。

用户体验：考虑如何优化用户输入、查询解析、答案呈现等交互环节，提供更流畅的用户体验。

7. 结论：拥抱本地 AI 的无限可能

通过本文的介绍和实操，我们成功地使用 ChromaDB 和 HuggingFace 生态，搭建了一个功能完整的本地化离线 RAG 问答系统。这不仅为我们提供了一个强大、隐私、灵活的知识问答工具，更重要的是，它赋予了开发者独立构建和控制 AI 应用的能力，摆脱了对云服务的依赖。

ChromaDB 的轻量级与持久化特性，使其成为本地化部署的理想向量数据库；而 HuggingFace 提供的丰富模型库，则让我们可以在本地高效地生成文本嵌入，并运行强大的 LLMs。LangChain 作为强大的编排框架，将这些组件无缝地集成在一起，形成了一个健壮的 RAG 流程。

无论是进行个人项目、学术研究，还是构建对数据隐私有严格要求的企业级应用，本地化 RAG 都是一个值得深入探索的方向。希望这个教程能为您开启本地 AI 应用开发的新篇章，让您在 AI 的世界里拥有更多的自主权和创造力！

查看全文

http://www.xdnf.cn/news/1449163.html

科技信息差（9.3）

uni app 的app端写入运行日志到指定文件夹。

Linux学习：生产者消费者模型

开源 C++ QT Widget 开发（十一）进程间通信--Windows 窗口通信

AI 大模型 “内卷” 升级：从参数竞赛到落地实用，行业正在发生哪些关键转变？

2025年经济学专业女性职业发展证书选择与分析

SCN随机配置网络时间序列预测Matlab实现

@Resource与@Autowired的区别

数据结构——顺序表和单向链表(2)

【Android】【设计模式】抽象工厂模式改造弹窗组件必知必会

Wan2.2AllInOne - Wan2.2极速视频生成模型，4步极速生成 ComfyUI工作流一键整合包下载

深度学习篇---模型组成部分

http和https区别是什么

Spring Boot 2.7 中资源销毁的先后顺序

mysqldump导出远程的数据库表（在java代码中实现）

VUE的模版渲染过程

FFMPEG H264

OpenLayers常用控件 -- 章节一：地图缩放控件详解教程

如何通过level2千档盘口分析挂单意图

JavaScript的输出语句

三阶Bezier曲线，已知曲线上一点到曲线起点的距离为L,计算这个点的参数u的方法

专题四_前缀和_一维前缀和

【OC】属性关键字

vtk资料整理

Linux arm64 PTE contiguous bit

linux可以直接用指针操作物理地址吗？

torch学习自用

python类的内置属性

AI重塑SaaS：从被动工具到智能角色的技术演进路径

【面试题】OOV(未登录词)问题如何解决？

相关文章：