【全网首发】知识库的批量导入以及更新
前言
前面说到数据库的自动更新完成智能体实时访问最新数据库,但由于数据库搜索方式过于刻板,仅仅能采用SQL语句搜索,没有办法采用RAG技术
知识库与数据库的区别
知识库采用RAG方式进行搜索,能够很好的推荐用户需求的商品信息。数据库采用SQL语句搜索,过于死板,无法匹配到合适的商品。打个比方,用户说:我头疼,能给我推荐些药品么,那么模型如果单纯从SQL语句出发并不能匹配出布洛芬这样的药品,但是RAG可以。
但是知识库不能excel后台上传,通过代码的方式上传到平台上,平台仅支持word,pdf的上传类型,那么为什么我们不能转变成word版的结构化输出通过代码上传给知识库呢,于是我做了个有趣的事情
excel表格到结构化文本的转变
import csv
from docx import Document
from docx.oxml.ns import qn
from docx.shared import Pt
import chardetdef detect_encoding(file_path):with open(file_path, 'rb') as f:data = f.read(10000)result = chardet.detect(data)return result['encoding']
def csv_to_word(csv_file, word_file):# 检测文件编码encoding = detect_encoding(csv_file)print(f"检测到的文件编码为: {encoding}")document = Document()document.styles['Normal'].font.name = u'宋体' # 设置中文字体document.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')with open(csv_file, encoding=encoding, errors='ignore') as file:reader = csv.reader(file)rows = list(reader)table = document.add_table(rows=len(rows), cols=len(rows[0]))table.style = 'Table Grid' for i, row in enumerate(rows):row_cells = table.rows[i].cells for j, cell in enumerate(row):row_cells[j].text = cell for row in table.rows:for cell in row.cells:for paragraph in cell.paragraphs:for run in paragraph.runs:run.font.size = Pt(10)document.save(word_file)print("转换完成!")
csv_to_word('********.csv', '*********.docx')
接口传递数据
我们采用的接口步骤为:创建接口——> 检索知识库中所有document_id——> 遍历所有document_id并删除——> 重新写入新文件至知识库。一套操作下来实现了更新和自动上传的操作
创建接口
# 创建知识库
url = "https://api.coze.cn/v1/datasets"
headers = {'Authorization': '*********************', # 替换为真实的token'Content-Type': 'application/json'
}
data = {'name': "guanlizhe_test",'space_id': '****************','format_type':0
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.status_code)
print("写入响应:", response.json())
检索知识库中所有document_id + 遍历所有document_id并删除 + 重新写入新文件至知识库
import requests
import json
import base64
## 查看知识库文件列表
url = "https://api.coze.cn/open_api/knowledge/document/list"
headers = {'Authorization': '*********************************', # 替换为真实的token'Content-Type': 'application/json','Agw-Js-Conv': 'str'
}
data = {'dataset_id': "***********",
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.status_code)
# print("查询响应:", response.json())
doucument_list = [ document_dict['document_id'] for document_dict in response.json()['document_infos']]
print("知识库中所有文档的id:", doucument_list)## 删除知识库文件
url = 'https://api.coze.cn/open_api/knowledge/document/delete'
data = {"document_ids": doucument_list}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.status_code)
print("删除响应:", response.json())filepath = '**************.docx'
with open(filepath, 'rb') as f1:base64_str = base64.b64encode(f1.read()) # base64类型# b'JVBERi0xLjUNCiXi48src = base64_str.decode('utf-8') # str# JVBERi0xLjUNCiXi48/print("src:",type(src))
url = "https://api.coze.cn/open_api/knowledge/doc*"
headers = {'Authorization': '*****************************************', # 替换为真实的token'Content-Type': 'application/json','Agw-Js-Conv': 'str'
}
data = {'dataset_id': "***************",'document_bases': [{"name":"2222.docx", "source_info": {"file_base64": src, "file_type":"docx"}}],'chunk_strategy':{ "separator": "\t\t", "max_tokens": 2000, "remove_extra_spaces": False, "remove_urls_emails": False, "chunk_type": 1 }
}
response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.status_code)
print("写入响应:", response.json())
至此,我觉得我克服了之前看通信类技术文档的恐惧,这几天训练下来我竟然可以脸不红心不跳的完成各种通信类技术文档,果然人还得逼一逼