当前位置：首页 > java >正文

利用jieba分词交集运算，挑选出相同身份证

java 2025/8/23 5:06:52

#创作灵感#因为感兴趣和身边有一些资料，我开发了一个程序，把几百G的多个PDF文件里面的身份证照片提取出来，再提取里面的文字，也就是身份证正面的文字信息，储存在sql2017数据库中，大概有5万行，因为为了方便，直接把身份证的所有文字转为字符串，写在每一行中。

但这样就出现一个问题，就是文件里面的身份证是有很多重复的，即5万行数据，里面可能只有1-2万张身份证，那么怎么把相同的身份证挑选出来呢。我想到了jieba分词。所以有下面的代码：

数据库的一行是这样的，sfzid是考虑循环找出这个表中相同的身份证，填第一个找到的身份证行的id号。

# -*- coding: utf-8 -*-
"""
Created on Mon Apr 28 14:59:39 2025@author: Yang
"""
import pymssql
import jiebaif __name__ == '__main__':# 连接到SQL Server数据库conn = pymssql.connect(server='127.0.0.1', user='sa', password='1', database='gr',charset='utf8')cursor = conn.cursor()sqlget = "select id,sfz from sfz where sfzid is null"cursor.execute(sqlget)rows = cursor.fetchall()# print(len(rows)) #所有的行数for row in rows:sfzxx = row[1]sfzid = row[0]sqlget0 = f"select count(id) from sfz where id = {sfzid} and sfzid is not null"cursor.execute(sqlget0)row0 = cursor.fetchone()print(row0[0],sfzid)if row0[0] == 0:sfzjb = set(jieba.lcut(sfzxx))sqlget1 = f"select id,sfz from sfz where id <> {sfzid} and sfzid is null" # sfzid is null andcursor.execute(sqlget1)rows1 = cursor.fetchall()for row1 in rows1:sfzxx1 = row1[1]sfzid1 = row1[0]sfzjb1 = set(jieba.lcut(sfzxx1))jt = sfzjb1 & sfzjbif len(jt)/len(sfzjb) > len(jt)/len(sfzjb1):jjl = len(jt)/len(sfzjb) else:jjl = len(jt)/len(sfzjb1)if jjl > 0.9:print(jjl,sfzid,sfzid1) print(sfzxx,sfzxx1)sql = "UPDATE sfz SET sfzid = %s WHERE id = %s"cursor.execute(sql, (sfzid,sfzid1, ))conn.commit()cursor.close()conn.close

原理是用ocr文字识别，可能存在一些差异，用jieba分词，交集运算，能够找出文字相同比例高的身份证信息，可以判断出2张身份证是否相同。

在for循环中，加上判断这一行是否有sfzid能避免重复判断相同身份证，提高效率。

这个程序好像要运行很久。。。

查看全文

http://www.xdnf.cn/news/7069.html