当前位置：首页 > java >正文

身份证信息OCR识别提取

java 2025/7/18 16:51:11

要实现Python中的身份证OCR识别，可以采用以下步骤和工具（结合开源库和API服务），以下是两种主流方案：

方案1：使用第三方OCR API（推荐百度/腾讯云）

百度OCR API 示例

注册并获取API Key：百度AI开放平台
安装SDK：
```
pip install baidu-aip
```

代码实现：

from aip import AipOcrAPP_ID = '你的AppID'
API_KEY = '你的API Key'
SECRET_KEY = '你的Secret Key'client = AipOcr(APP_ID, API_KEY, SECRET_KEY)def baidu_id_card_ocr(image_path):with open(image_path, "rb") as f:image = f.read()# 调用身份证识别接口（正面: idcard_front, 背面: idcard_back）result = client.idcard(image, "front")# 提取结构化数据if "words_result" in result:return result["words_result"]return None# 输出结果示例：
# {'姓名': {'words': '张三'}, '性别': {'words': '男'}, ...}

方案2：使用开源OCR库（推荐 PaddleOCR）

安装依赖

pip install paddlepaddle paddleocr opencv-python

示例代码

from paddleocr import PaddleOCR
import cv2# 初始化OCR模型（使用中英文超轻量模型）
ocr = PaddleOCR(use_angle_cls=True, lang='ch')  # 设置语言为中文def id_card_ocr(image_path):# 读取图片img = cv2.imread(image_path)# OCR识别result = ocr.ocr(img, cls=True)# 提取所有识别文本texts = [line[1][0] for line in result[0]]# 关键字段提取（简单规则匹配）info = {}for text in texts:if "姓名" in text:info["name"] = text.split("姓名")[-1].strip()elif "性别" in text:info["gender"] = text.split("性别")[-1][0]  # 取第一个字符elif "公民身份号码" in text:# 提取18位身份证号（兼容末尾X）id_num = ''.join(filter(str.isalnum, text))info["id_number"] = id_num[-18:].upper()# 添加更多字段（出生、住址等）...return info# 测试识别
result = id_card_ocr("id_card.jpg")
print(result)

优化方向

图像预处理（提高准确率）：

# 灰度化 + 二值化
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 180, 255, cv2.THRESH_BINARY_INV)

字段精确匹配：使用正则表达式提取出生日期、地址等。
文字区域定位：通过轮廓检测或深度学习定位身份证关键区域。

关键问题处理

提高准确率：
- 图像预处理（去噪、增强对比度、透视矫正）
- 使用高分辨率图像（建议≥1024px）
- 针对身份证训练专用OCR模型（需标注数据）

字段结构化：

# 正则提取身份证号
import re
id_text = "公民身份号码 12345620000101123X"
id_num = re.search(r'[1-9]\d{5}(19|20)\d{10}(\d|X)', id_text).group()

完整字段列表：
- 正面：姓名、性别、民族、出生、住址、公民身份号码
- 反面：签发机关、有效期限

工具	优点	缺点
PaddleOCR	免费、离线、可定制	需调参优化
百度OCR	高精度、开箱即用	收费（免费额度有限）
Tesseract	开源通用	中文身份证精度较低

进阶建议

关键区域裁剪：先使用目标检测模型（如YOLO）定位身份证区域再OCR。
自定义训练：用PPOCR训练自己的身份证模型（需标注数据集）。
活体检测：结合人脸识别技术验证身份证真伪。

通过上述方案，可快速实现身份证信息的自动化识别与结构化提取。

Python3.11.0 PaddleOCR3.0.0实战

官方文档

示例代码：

import cv2
import re
from paddleocr import PaddleOCR
from PIL import Image
import os
import jsonclass IDCardOCR:def __init__(self):# 初始化OCR模型，使用中文识别，关闭方向检测self.ocr = PaddleOCR(use_doc_orientation_classify=False,use_doc_unwarping=False,use_textline_orientation=False)# 定义身份证信息的正则表达式模板self.templates = {'姓名': r'姓名(?:：|:|)(.+?)(?=性别)','性别': r'性别(?:：|:|)(.+?)(?=民族)','民族': r'民族(?:：|:|)(.+?)(?=出生)','出生日期': r'出生(?:：|:|)(.+?)(?=住址)','地址': r'住址(?:：|:|)(.+?)(?=公民身份号码)','身份证号码': r'公民身份号码(?:：|:|)(.+)'}# 出生日期格式转换self.birth_date_pattern = r'(\d{4})(\d{2})(\d{2})'def recognize_id_card(self, image_path):# 读取图像img = cv2.imread(image_path)if img is None:return {"错误": f"无法读取图像: {image_path}"}# 使用PaddleOCR进行文本识别result = self.ocr.predict(input=image_path)# 提取所有识别的文本all_text = ""for line in result[0]['rec_texts']:all_text += line# 使用正则表达式提取身份证信息extracted_info = self._extract_info_from_text(all_text)return extracted_infodef _extract_info_from_text(self, text):info = {}# 替换一些干扰字符cleaned_text = text.replace(' ', '').replace('"', '')# 提取各字段信息for field, pattern in self.templates.items():match = re.search(pattern, cleaned_text, re.DOTALL)if match:value = match.group(1).strip()# 处理出生日期格式if field == '出生日期':value = re.sub(self.birth_date_pattern, r'\1年\2月\3日', value)info[field] = valuereturn infodef visualize_results(self, image_path, result, output_path='./id_card_result.jpg'):image = Image.open(image_path).convert('RGB')boxes = [line[0] for line in result[0]]txts = [line[1][0] for line in result[0]]scores = [line[1][1] for line in result[0]]im_show = draw_ocr(image, boxes, txts, scores, font_path='./fonts/simfang.ttf')im_show = Image.fromarray(im_show)im_show.save(output_path)print(f"可视化结果已保存至: {output_path}")def save_to_json(self, info, output_path='./id_card_info.json'):with open(output_path, 'w', encoding='utf-8') as f:json.dump(info, f, ensure_ascii=False, indent=4)print(f"识别结果已保存至: {output_path}")if __name__ == "__main__":# 使用示例ocr = IDCardOCR()# 替换为你的身份证图像路径image_path = 'C:\\Users\\FanZhou\\PycharmProjects\\test\\resources\\test_002.png'# 检查图像文件是否存在if not os.path.exists(image_path):print(f"错误：图像文件不存在: {image_path}")else:# 识别身份证信息result_info = ocr.recognize_id_card(image_path)# 打印识别结果print("\n身份证信息识别结果:")for key, value in result_info.items():print(f"{key}: {value}")# 保存结果到JSON文件ocr.save_to_json(result_info)