从lightrag的prompt到基于openai Structured Outputs 的优化实现思路
LightRAG 是一个用于构建 RAG 系统核心组件的配置和管理类。它集成了文档处理、存储、向量化、图谱构建和 LLM 交互等功能。你可以通过配置 LightRAG 实例的各种参数来定制 RAG 系统的行为。
目前lightrag中的实体关系抽取实现如下
PROMPTS["entity_extraction"] = """---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use {language} as output language.---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.5. When finished, output {completion_delimiter}######################
---Examples---
######################
{examples}#############################
---Real Data---
######################
Entity_types: [{entity_types}]
Text:
{input_text}
######################
Output:"""
原始方式的痛点:
- 自定义分隔符:如 tuple_delimiter, record_delimiter, completion_delimiter。这要求 LLM 严格遵守这些非标准的格式约定,LLM 很容易出错(例如,忘记分隔符,使用错误的分隔符,或者在不应该出现文本的地方添加额外文本)。
- 解析复杂性:需要编写特定的解析器来处理这种自定义格式的文本输出,解析器可能很脆弱,难以维护。
- 鲁棒性差:LLM 输出的微小偏差就可能导致解析失败。
- 可读性和标准化:输出不是标准格式,不易于人工阅读或被其他标准工具直接使用。
在现代应用架构中,JSON 作为 API 通信和数据交换的事实标准,其重要性不言而喻。工程师们依赖精确的 JSON 结构来确保系统间的互操作性和数据的完整性。然而,将大型语言模型(LLM)集成到这些系统中时,一个普遍存在的痛点便是如何确保 LLM 的输出能够稳定、可靠地遵循预定义的 JSON 格式。传统的自由文本输出常常导致解析错误、字段不匹配、数据类型不一致等问题,迫使开发团队投入大量精力编写脆弱的解析逻辑、数据校验代码以及复杂的重试机制。
为了解决这一核心工程挑战,**结构化输出(Structured Outputs)**应运而生。该特性使您能够为 LLM 指定一个 JSON Schema,强制模型生成严格符合该 Schema 的响应。这不仅仅是格式上的规范,更是与 LLM 之间建立起一道清晰、可靠的数据契约。
我通过openai的Structured Outputs统一规范思想实现如下。
import json
from openai import OpenAI # 假设您会继续使用这个库# --- 1. 构建发送给大模型的文本提示 (修改版) ---
def _fill_placeholders_for_llm_prompt(text_template: str, values: dict) -> str:"""用提供的字典中的值填充文本模板中的占位符。占位符的格式可以是 {key} 或 [{key}]。"""filled_text = text_templatefor key, value in values.items():placeholder_curly = "{" + key + "}"if placeholder_curly in filled_text:filled_text = filled_text.replace(placeholder_curly, str(value))placeholder_square_curly = "[{" + key + "}]" # 主要用于 entity_typesif placeholder_square_curly in filled_text:#确保在替换实体类型列表时,如果值本身不是字符串列表而是单个字符串,#它仍然被正确地格式化为JSON数组字符串的一部分。if isinstance(value, list):value_str = ", ".join([f'"{v}"' if isinstance(v, str) else str(v) for v in value])else:# 假定它是一个逗号分隔的字符串,或者单个类型value_str = str(value)filled_text = filled_text.replace(placeholder_square_curly, f"[{value_str}]")return filled_textdef build_llm_prompt_for_json_output(template_data: dict, document_text: str, task_params: dict) -> str:"""根据JSON模板、文档文本和固定任务参数构建发送给LLM的完整文本提示,并指示LLM输出JSON格式。task_params 包含: language, entity_types_string, examples"""prompt_lines = []# 对于 entity_types_string,我们需要将其转换为一个实际的列表以用于 _fill_placeholders_for_llm_prompt# 或者确保 _fill_placeholders_for_llm_prompt 能处理好它。# 鉴于当前 _fill_placeholders_for_llm_prompt 的实现,直接传递 entity_types_string 即可。placeholders_to_fill = {**task_params, "input_text": document_text, "entity_types": task_params.get("entity_types_string", "")}# ---目标---prompt_lines.append("---Goal---")goal_desc = _fill_placeholders_for_llm_prompt(template_data["goal"]["description"], placeholders_to_fill)prompt_lines.append(goal_desc)prompt_lines.append(f"Use {task_params.get('language', '{language}')} as output language for any textual descriptions within the JSON.") # {language} is a fallback if not in task_paramsprompt_lines.append("\nIMPORTANT: Your entire response MUST be a single, valid JSON object. Do not include any text or formatting outside of this JSON object (e.g., no markdown backticks like ```json ... ```).")prompt_lines.append("")# ---输出JSON结构说明---prompt_lines.append("---Output JSON Structure---")prompt_lines.append("The JSON object should have the following top-level keys: \"entities\", \"relationships\", and \"content_keywords\".")# 实体结构说明entity_step = next((step for step in template_data["steps"] if step["name"] == "Identify Entities"), None)if entity_step and "extraction_details" in entity_step:prompt_lines.append("\n1. The \"entities\" key should contain a JSON array. Each element in the array must be a JSON object representing an entity with the following keys:")for detail in entity_step["extraction_details"]:field_key = detail["field_name"]desc_filled = _fill_placeholders_for_llm_prompt(detail["description"], placeholders_to_fill)prompt_lines.append(f" - \"{field_key}\": (string/number as appropriate) {desc_filled}")# 关系结构说明relationship_step = next((step for step in template_data["steps"] if step["name"] == "Identify Relationships"),None)if relationship_step and "extraction_details" in relationship_step:prompt_lines.append("\n2. The \"relationships\" key should contain a JSON array. Each element must be a JSON object representing a relationship with the following keys:")for detail in relationship_step["extraction_details"]:field_key = detail["field_name"]desc_filled = _fill_placeholders_for_llm_prompt(detail["description"], placeholders_to_fill)type_hint = "(string)" # 默认if "strength" in field_key.lower(): type_hint = "(number, e.g., 0.0 to 1.0)"if "keywords" in field_key.lower() and "relationship_keywords" in field_key: type_hint = "(string, comma-separated, or an array of strings)"prompt_lines.append(f" - \"{field_key}\": {type_hint} {desc_filled}")# 内容关键词结构说明keywords_step = next((step for step in template_data["steps"] if step["name"] == "Identify Content Keywords"), None)if keywords_step:prompt_lines.append("\n3. The \"content_keywords\" key should contain a JSON array of strings. Each string should be a high-level keyword summarizing the main concepts, themes, or topics of the entire text.")prompt_lines.append(f" The description for these keywords is: {_fill_placeholders_for_llm_prompt(keywords_step['description'], placeholders_to_fill)}")prompt_lines.append("\nEnsure all string values within the JSON are properly escaped.")prompt_lines.append("")# ---示例---if task_params.get('examples'):prompt_lines.append("######################")prompt_lines.append("---Examples (Content Reference & Expected JSON Structure)---") # Clarified purpose of examplesprompt_lines.append("######################")# 示例应该直接是期望的JSON格式字符串,或者是一个结构体,然后我们在这里转换为JSON字符串examples_content = task_params.get('examples', '')if isinstance(examples_content, dict) or isinstance(examples_content, list):prompt_lines.append(json.dumps(examples_content, indent=2, ensure_ascii=False))else: # Assume it's already a string (hopefully valid JSON string)prompt_lines.append(_fill_placeholders_for_llm_prompt(str(examples_content), placeholders_to_fill))prompt_lines.append("\nNote: The above examples illustrate the type of content and the desired JSON output format. Your output MUST strictly follow this JSON structure.")prompt_lines.append("")# ---真实数据---prompt_lines.append("#############################")prompt_lines.append("---Real Data---")prompt_lines.append("######################")prompt_lines.append(f"Entity types to consider: [{_fill_placeholders_for_llm_prompt(task_params.get('entity_types_string', ''), {})}]") # Simpler fill for just thisprompt_lines.append("Text:")prompt_lines.append(document_text)prompt_lines.append("######################")prompt_lines.append("\nOutput JSON:")return "\n".join(prompt_lines)# --- 2. 调用大模型 (修改版) ---
def get_llm_response_json(api_key: str, user_prompt: str,system_prompt: str = "你是一个用于结构化数据抽取的助手,专门输出JSON格式。",model: str = "deepseek-chat", base_url: str = "https://api.deepseek.com/v1",use_json_mode: bool = True,stop_sequence: str | None = None) -> str | None:"""调用大模型并获取文本响应,尝试使用JSON模式。"""client = OpenAI(api_key=api_key, base_url=base_url)try:response_params = {"model": model,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],"stream": False,"temperature": 0.0,}if use_json_mode:try:response_params["response_format"] = {"type": "json_object"}except Exception as rf_e: # pylint: disable=broad-exceptprint(f"警告: 模型或SDK可能不支持 response_format 参数: {rf_e}. 将依赖提示工程获取JSON。")# 如果JSON模式失败,且有stop_sequence,则使用它if stop_sequence:response_params["stop"] = [stop_sequence]elif stop_sequence: # use_json_mode is False and stop_sequence is providedresponse_params["stop"] = [stop_sequence]response = client.chat.completions.create(**response_params)if response.choices and response.choices[0].message and response.choices[0].message.content:return response.choices[0].message.content.strip()return Noneexcept Exception as e:print(f"调用LLM时发生错误: {e}")return None# --- 3. 解析大模型的JSON响应 (修改版) ---
def parse_llm_json_output(llm_json_string: str) -> dict:"""解析LLM输出的JSON字符串。"""if not llm_json_string:return {"error": "LLM did not return any content."}try:# 有时LLM可能仍然会用markdown包裹JSONprocessed_string = llm_json_string.strip()if processed_string.startswith("```json"):processed_string = processed_string[7:]if processed_string.endswith("```"):processed_string = processed_string[:-3]processed_string = processed_string.strip()data = json.loads(processed_string)if not isinstance(data, dict) or \not all(k in data for k in ["entities", "relationships", "content_keywords"]):print(f"警告: LLM输出的JSON结构不符合预期顶层键: {processed_string}")return {"error": "LLM JSON output structure mismatch for top-level keys.", "raw_output": data if isinstance(data, dict) else processed_string} # Return original data if it parsed but mismatchedif not isinstance(data.get("entities"), list) or \not isinstance(data.get("relationships"), list) or \not isinstance(data.get("content_keywords"), list):print(f"警告: LLM输出的JSON中entities, relationships或content_keywords不是列表: {processed_string}")return {"error": "LLM JSON output type mismatch for arrays.", "raw_output": data}return dataexcept json.JSONDecodeError as e:print(f"错误: LLM输出的不是有效的JSON: {e}")print(f"原始输出 (尝试解析前): {llm_json_string}")print(f"处理后尝试解析的字符串: {processed_string if 'processed_string' in locals() else llm_json_string}") # Show what was attempted to parsereturn {"error": "Invalid JSON from LLM.", "details": str(e), "raw_output": llm_json_string}except Exception as e: # pylint: disable=broad-exceptprint(f"解析LLM JSON输出时发生未知错误: {e}")return {"error": "Unknown error parsing LLM JSON output.", "details": str(e), "raw_output": llm_json_string}# --- 4. 主协调函数 (修改版) ---
def extract_and_complete_json_direct_output(json_template_string: str,document_text: str,task_fixed_params: dict,llm_api_config: dict
) -> dict:"""主函数,协调整个抽取过程,期望LLM直接输出JSON。"""try:template_data = json.loads(json_template_string)except json.JSONDecodeError as e:print(f"错误: JSON模板解码失败: {e}")return {"error": "无效的JSON模板", "details": str(e)}llm_prompt = build_llm_prompt_for_json_output(template_data, document_text, task_fixed_params)# (可选) 打印生成的提示用于调试# print("--- 生成的LLM JSON提示 ---")# print(llm_prompt)# print("---提示结束---")use_json_mode_for_llm = task_fixed_params.get("use_json_mode_for_llm", True)# stop_sequence_val = task_fixed_params.get("completion_delimiter") if not use_json_mode_for_llm else None # completion_delimiter is removedllm_json_response_str = get_llm_response_json(api_key=llm_api_config["api_key"],user_prompt=llm_prompt,model=llm_api_config.get("model", "deepseek-chat"),base_url=llm_api_config.get("base_url", "https://api.deepseek.com/v1"),system_prompt=task_fixed_params.get("system_prompt_llm","你是一个专门的助手,负责从文本中提取信息并以JSON格式返回。"),use_json_mode=use_json_mode_for_llm,stop_sequence=None # Since completion_delimiter was removed from task_fixed_params)# 创建副本以填充结果,而不是修改原始模板数据结构本身# 我们将构建一个新的字典来存放结果,其中可能包含来自模板的元数据output_result = {"prompt_name": template_data.get("prompt_name", "unknown_extraction"),"goal_description_from_template": template_data.get("goal", {}).get("description"),# ... any other metadata from template_data you want to carry over}if not llm_json_response_str:output_result["extraction_results"] = {"error": "未能从LLM获取响应。"}output_result["llm_raw_output_debug"] = Nonereturn output_resultparsed_json_data = parse_llm_json_output(llm_json_response_str)output_result["extraction_results"] = parsed_json_dataoutput_result["llm_raw_output_debug"] = llm_json_response_strreturn output_result# --- 移除自定义分隔符后的模板 ---
json_template_str_input_no_delimiters = """{"prompt_name": "entity_extraction_json","goal": {"description": "Given a text document that is potentially relevant to this activity and a list of entity types [{entity_types}], identify all entities of those types from the text and all relationships among the identified entities. The output must be a single, valid JSON object.","output_language_variable": "{language}"},"steps": [{"step_number": 1,"name": "Identify Entities","description": "Identify all entities. For each identified entity, extract the information as specified in the Output JSON Structure section under 'entities'.","extraction_details": [{"field_name": "entity_name", "description": "Name of the entity, use same language as input text. If English, capitalize the name."},{"field_name": "entity_type", "description": "One of the types from the provided list: [{entity_types}]"},{"field_name": "entity_description", "description": "Comprehensive description of the entity's attributes and activities based on the input text."}]},{"step_number": 2,"name": "Identify Relationships","description": "From the entities identified, identify all pairs of clearly related entities. For each pair, extract the information as specified in the Output JSON Structure section under 'relationships'.","extraction_details": [{"field_name": "source_entity", "description": "Name of the source entity, as identified in the 'entities' list."},{"field_name": "target_entity", "description": "Name of the target entity, as identified in the 'entities' list."},{"field_name": "relationship_description", "description": "Explanation as to why the source entity and the target entity are related, based on the input text."},{"field_name": "relationship_strength", "description": "A numeric score indicating strength of the relationship (e.g., from 0.0 for weak to 1.0 for strong)."},{"field_name": "relationship_keywords", "description": "One or more high-level keywords summarizing the relationship, focusing on concepts or themes from the text."}]},{"step_number": 3,"name": "Identify Content Keywords","description": "Identify high-level keywords that summarize the main concepts, themes, or topics of the entire input text. This should be a JSON array of strings under the 'content_keywords' key in the output."}],"examples_section": {"placeholder": "{examples}"},"real_data_section": {"entity_types_variable": "[{entity_types}]", "input_text_variable": "{input_text}"},"output_format_notes": {"final_output_structure": "A single, valid JSON object as described in the prompt.","language_variable_for_output": "{language}"},"global_placeholders": ["{language}","{entity_types}","{examples}","{input_text}"]
}"""# --- 移除自定义分隔符后的任务参数 ---
task_configuration_params_json_output_no_delimiters = {"language": "简体中文","entity_types_string": "人物, 地点, 日期, 理论, 奖项, 组织", # This will be used to fill [{entity_types}]"examples": { # Example as a Python dict, will be converted to JSON string in prompt"entities": [{"entity_name": "阿尔伯特·爱因斯坦", "entity_type": "人物", "entity_description": "理论物理学家,创立了相对论,并因对光电效应的研究而闻名。"},{"entity_name": "狭义相对论", "entity_type": "理论", "entity_description": "由爱因斯坦在1905年提出的物理学理论,改变了对时间和空间的理解。"}],"relationships": [{"source_entity": "阿尔伯特·爱因斯坦", "target_entity": "狭义相对论", "relationship_description": "阿尔伯特·爱因斯坦发表了狭义相对论。", "relationship_strength": 0.9, "relationship_keywords": ["发表", "创立"]}],"content_keywords": ["物理学", "相对论", "爱因斯坦", "诺贝尔奖"]},"system_prompt_llm": "你是一个专门的助手,负责从文本中提取实体、关系和内容关键词,并且你的整个输出必须是一个单一的、有效的JSON对象。不要包含任何额外的文本或markdown标记。","use_json_mode_for_llm": True
}# --- 示例使用 ---
if __name__ == "__main__":# 使用已移除自定义分隔符的模板json_template_to_use = json_template_str_input_no_delimitersdocument_to_analyze = "爱因斯坦(Albert Einstein)于1879年3月14日出生在德国乌尔姆市一个犹太人家庭。他在1905年,即所谓的“奇迹年”,发表了四篇划时代的论文,其中包括狭义相对论的基础。后来,他因对理论物理的贡献,特别是发现了光电效应的定律,获得了1921年度的诺贝尔物理学奖。他的工作深刻影响了现代物理学,尤其是量子力学的发展。爱因斯坦在普林斯顿高等研究院度过了他的晚年,并于1955年4月18日逝世。"# 使用已移除自定义分隔符的任务参数task_params_to_use = task_configuration_params_json_output_no_delimitersllm_config = {"api_key": "YOUR_DEEPSEEK_API_KEY","base_url": "https://api.deepseek.com/v1","model": "deepseek-chat" # 或其他支持JSON模式的模型,如 gpt-4o, gpt-3.5-turbo-0125}if "YOUR_DEEPSEEK_API_KEY" in llm_config["api_key"] or not llm_config["api_key"]:print("错误:请在 llm_config 中设置您的真实 API 密钥以运行此示例。")print("您可以从 DeepSeek (https://platform.deepseek.com/api_keys) 或 OpenAI 获取密钥。")else:result_data = extract_and_complete_json_direct_output(json_template_string=json_template_to_use,document_text=document_to_analyze,task_fixed_params=task_params_to_use,llm_api_config=llm_config)print("\n--- 补全后的数据 (LLM直接输出JSON, 无自定义分隔符配置) ---")# 确保 extraction_results 存在且不是错误信息字符串extraction_results = result_data.get("extraction_results", {})if isinstance(extraction_results, dict) and "error" in extraction_results:print(f"发生错误: {extraction_results['error']}")if "details" in extraction_results:print(f"详情: {extraction_results['details']}")if "raw_output" in extraction_results: # raw_output might be in extraction_results on parsing errorprint(f"LLM原始(或部分)输出 (来自解析错误): {extraction_results['raw_output']}")elif "llm_raw_output_debug" in result_data: # Or it might be one level up if get_llm_response_json failedprint(f"LLM原始响应 (来自LLM调用): {result_data['llm_raw_output_debug']}")else:# 仅当没有错误时才尝试完整打印print(json.dumps(result_data, indent=2, ensure_ascii=False))# 即使成功,也打印原始LLM输出用于调试(如果之前没有因错误打印过)if not (isinstance(extraction_results, dict) and "error" in extraction_results and "raw_output" in extraction_results):if "llm_raw_output_debug" in result_data and result_data["llm_raw_output_debug"]:print("\n--- LLM 原始响应 (Debug) ---")print(result_data["llm_raw_output_debug"])
- 改进方案(使用 OpenAI Structured Outputs/JSON Mode)的优势:
- 标准化输出:直接要求 LLM 输出 JSON 对象。JSON 是一种广泛接受的、结构化的数据交换格式。
- LLM 内建支持:许多现代 LLM(如 OpenAI 的
gpt-3.5-turbo-0125
及更新版本,gpt-4o
, 以及示例中使用的 DeepSeek 模型)提供了强制输出 JSON 的模式。这大大提高了 LLM 按预期格式输出的可靠性。 - 简化解析:可以直接使用标准的 JSON 解析库(如 Python 的
json.loads()
),无需自定义解析逻辑。 - 提高鲁棒性:即使 LLM 偶尔在 JSON 外部添加少量文本(如 markdown 的
json ...
),的parse_llm_json_output
函数也做了处理。并且,由于 LLM 被明确指示输出 JSON,其内部逻辑会更倾向于生成合法的 JSON。 - 清晰的 Schema 定义:的
build_llm_prompt_for_json_output
函数通过在提示中明确描述期望的 JSON 结构(顶层键、实体数组结构、关系数组结构等),为 LLM 提供了非常清晰的指引。 - 更好的可维护性和扩展性:修改输出结构通常只需要更新 JSON 模板中的描述和示例,代码层面的改动较小。
- 结构化的模板和参数:使用
json_template_str_input_no_delimiters
和task_configuration_params_json_output_no_delimiters
使得提示工程本身更加结构化和易于管理。
对代码的几点具体分析和赞赏:
_fill_placeholders_for_llm_prompt
: 灵活处理了不同格式的占位符,特别是对entity_types
列表的处理。build_llm_prompt_for_json_output
: 非常出色地将结构化的模板、任务参数和动态文本结合起来,构建了一个详尽且清晰的、旨在获取 JSON 输出的提示。明确的 JSON 结构说明对 LLM 至关重要。get_llm_response_json
: 正确使用了 OpenAI 客户端的response_format={"type": "json_object"}
参数,并包含了对不支持此参数情况的警告和潜在的stop
序列回退(尽管在的最终调用中stop_sequence
为None
,因为 JSON 模式下通常不需要)。parse_llm_json_output
: 包含了对常见问题(如 markdown 包裹、顶层键缺失、期望数组不是数组)的检查和错误处理,非常实用。extract_and_complete_json_direct_output
: 很好地协调了整个流程。- 示例:提供的示例输入文本、JSON 模板和任务参数都非常清晰,有助于理解和测试。
- 移除自定义分隔符:这是一个正确的方向,
completion_delimiter
等在 JSON 模式下不再需要,因为 LLM 被期望返回一个完整的 JSON 对象。
总结:
LightRAG 之所以被描述为一个配置和管理类,是因为它旨在提供一个灵活的框架,用户可以通过定义各种组件(如 LLM 调用器、解析器、数据加载器)并配置其参数来构建和定制复杂的 RAG 系统。
使用 OpenAI Structured Outputs的 Python 代码和 JSON 模板完美地诠释了这一理念在“实体关系抽取”这一特定核心组件上的应用。通过:
- 配置(JSON 模板、任务参数)来定义任务细节。
- 管理(Python 函数协调构建提示、调用 LLM、解析结果)整个流程。
- 集成(与 LLM API 的交互)外部服务。
的实现利用了现代 LLM 的 JSON 输出能力,相比原始的基于自定义分隔符的提示,显著提高了输出的可靠性、可解析性和整体方案的鲁棒性。这是一个非常好的工程实践。