常见字符串相似度算法详解
目录
引言
一、Levenshtein距离(编辑距离)
1.1 算法原理
1.2 Java实现
1.3 springboot中实现
二、Jaro-Winkler相似度
2.1 算法特点
2.2 Java实现
三、余弦相似度(向量空间模型)
3.1 实现步骤
3.2 Java实现
3.3 简化版
四、Jaccard系数
4.1 算法原理
4.2 Java实现
4.3 简化版
五、N-gram相似度
5.1 实现示例(Bigram)
六、算法对比与选型建议
七、性能优化技巧
八、完整测试示例
结语
引言
在文本处理、搜索引擎、数据清洗等场景中,字符串相似度计算是核心技术之一。本文将深入讲解5种经典算法,并提供可直接运行的Java代码实现。
一、Levenshtein距离(编辑距离)
1.1 算法原理
计算将字符串A转换为字符串B所需的最少单字符编辑(插入、删除、替换)次数
动态规划公式:
dp[i][j] = min(dp[i-1][j]+1, dp[i][j-1]+1, dp[i-1][j-1]+cost)
1.2 Java实现
public static int levenshtein(String a, String b) {int[][] dp = new int[a.length()+1][b.length()+1];for (int i=0; i<=a.length(); i++) dp[i][0] = i;for (int j=0; j<=b.length(); j++) dp[0][j] = j;for (int i=1; i<=a.length(); i++) {for (int j=1; j<=b.length(); j++) {int cost = (a.charAt(i-1) == b.charAt(j-1)) ? 0 : 1;dp[i][j] = Math.min(Math.min(dp[i-1][j] + 1, // 删除dp[i][j-1] + 1), // 插入dp[i-1][j-1] + cost // 替换);}}return dp[a.length()][b.length()];
}
相似度计算:
public static double levenshteinSimilarity(String a, String b) {int maxLen = Math.max(a.length(), b.length());return 1.0 - (double)levenshtein(a, b) / maxLen;
}
1.3 springboot中实现
<!-- pom.xml 中添加依赖 -->
<dependency><groupId>org.apache.commons</groupId><artifactId>commons-text</artifactId><version>1.9</version>
</dependency>
public static double levenshteinSimilarity(String str1, String str2) {LevenshteinDistance distance = new LevenshteinDistance();int maxLength = Math.max(str1.length(), str2.length());if (maxLength == 0) return 1.0;int editDistance = distance.apply(str1, str2);return 1 - (double) editDistance / maxLength;}
二、Jaro-Winkler相似度
2.1 算法特点
对前缀相同的字符串更友好(如人名匹配)
适合短字符串比较
时间复杂度:O(n)
2.2 Java实现
public static double jaroWinkler(String s1, String s2) {int maxLen = Math.max(s1.length(), s2.length());if (maxLen == 0) return 1.0;// 匹配窗口大小int matchWindow = Math.max(maxLen/2 - 1, 0);// 匹配字符统计boolean[] s1Matches = new boolean[s1.length()];boolean[] s2Matches = new boolean[s2.length()];int matches = 0, transpositions = 0;// 统计匹配字符for (int i=0; i<s1.length(); i++) {int start = Math.max(0, i-matchWindow);int end = Math.min(i+matchWindow+1, s2.length());for (int j=start; j<end; j++) {if (!s2Matches[j] && s1.charAt(i) == s2.charAt(j)) {s1Matches[i] = true;s2Matches[j] = true;matches++;break;}}}if (matches == 0) return 0.0;// 计算换位次数int k = 0;for (int i=0; i<s1.length(); i++) {if (!s1Matches[i]) continue;while (!s2Matches[k]) k++;if (s1.charAt(i) != s2.charAt(k)) transpositions++;k++;}double jaro = ((matches/(double)s1.length()) + (matches/(double)s2.length()) + ((matches - transpositions/2.0)/matches)) / 3.0;// Winkler调整前缀权重int prefixLen = 0;for (int i=0; i<Math.min(4, Math.min(s1.length(), s2.length())); i++) {if (s1.charAt(i) == s2.charAt(i)) prefixLen++;else break;}return jaro + prefixLen * 0.1 * (1 - jaro);
}
三、余弦相似度(向量空间模型)
3.1 实现步骤
将字符串转换为词频向量
计算向量夹角余弦值
3.2 Java实现
public static double cosineSimilarity(String a, String b) {Map<String, Integer> vec1 = getTermFrequency(a);Map<String, Integer> vec2 = getTermFrequency(b);// 计算点积double dotProduct = 0;for (String term : vec1.keySet()) {if (vec2.containsKey(term)) {dotProduct += vec1.get(term) * vec2.get(term);}}// 计算模长double normA = vec1.values().stream().mapToDouble(v->v*v).sum();double normB = vec2.values().stream().mapToDouble(v->v*v).sum();return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}// 词频统计(按单词分割)
private static Map<String, Integer> getTermFrequency(String text) {return Arrays.stream(text.split("\\s+")).collect(Collectors.toMap(word -> word, word -> 1, Integer::sum));
}
3.3 简化版
// 余弦相似度(简化版)public static double cosineSimilarity(String str1, String str2) {Map<Character, int[]> vector = new HashMap<>();for (char c : (str1 + str2).toCharArray()) {if (!vector.containsKey(c)) vector.put(c, new int[2]);}for (char c : str1.toCharArray()) vector.get(c)[0]++;for (char c : str2.toCharArray()) vector.get(c)[1]++;double dotProduct = 0, normA = 0, normB = 0;for (int[] counts : vector.values()) {dotProduct += counts[0] * counts[1];normA += Math.pow(counts[0], 2);normB += Math.pow(counts[1], 2);}return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));}
四、Jaccard系数
4.1 算法原理
计算两个集合的交集与并集的比例
4.2 Java实现
public static double jaccardSimilarity(String a, String b) {Set<String> setA = new HashSet<>(Arrays.asList(a.split("")));Set<String> setB = new HashSet<>(Arrays.asList(b.split("")));int intersection = 0;for (String s : setA) {if (setB.contains(s)) intersection++;}int union = setA.size() + setB.size() - intersection;return (double) intersection / union;
}
4.3 简化版
public static double jaccardSimilarity(String str1, String str2) {Set<Character> set1 = new HashSet<>();Set<Character> set2 = new HashSet<>();for (char c : str1.toCharArray()) set1.add(c);for (char c : str2.toCharArray()) set2.add(c);Set<Character> intersection = new HashSet<>(set1);intersection.retainAll(set2);Set<Character> union = new HashSet<>(set1);union.addAll(set2);return union.isEmpty() ? 1.0 : (double) intersection.size() / union.size();}
五、N-gram相似度
5.1 实现示例(Bigram)
public static double nGramSimilarity(String a, String b, int n) {List<String> gramsA = generateNGrams(a, n);List<String> gramsB = generateNGrams(b, n);Set<String> union = new HashSet<>(gramsA);union.addAll(gramsB);int intersection = 0;for (String gram : gramsA) {if (gramsB.contains(gram)) intersection++;}return (double) intersection / union.size();
}// 生成N-gram
private static List<String> generateNGrams(String text, int n) {List<String> grams = new ArrayList<>();for (int i=0; i<=text.length()-n; i++) {grams.add(text.substring(i, i+n));}return grams;
}
六、算法对比与选型建议
算法 | 适用场景 | 时间复杂度 | 特点 |
---|---|---|---|
Levenshtein | 拼写纠错、DNA分析 | O(n²) | 精确但较慢 |
Jaro-Winkler | 人名匹配、短文本 | O(n) | 前缀权重优化 |
余弦相似度 | 长文档、搜索引擎 | O(n) | 基于词频统计 |
Jaccard | 集合相似性、去重 | O(n) | 简单快速 |
N-gram | 模糊搜索、推荐系统 | O(n) | 可调节粒度 |
七、性能优化技巧
-
预处理优化
-
统一转换为小写
-
去除停用词和标点符号
-
-
缓存机制
-
对频繁计算的字符串缓存结果
-
-
近似算法
-
使用SimHash等快速近似算法
-
-
并行计算
List<Double> results = strings.parallelStream().map(s -> cosineSimilarity(baseStr, s)).collect(Collectors.toList());
八、完整测试示例
public static void main(String[] args) {String s1 = "apple";String s2 = "appel";System.out.println("Levenshtein相似度: " + levenshteinSimilarity(s1, s2));System.out.println("Jaro-Winkler: " + jaroWinkler(s1, s2));System.out.println("Bigram相似度: " + nGramSimilarity(s1, s2, 2));
}
输出结果:
Levenshtein相似度: 0.6
Jaro-Winkler: 0.9066666666666667
Bigram相似度: 0.6666666666666666
结语
根据实际需求选择合适的算法:
精确匹配优先Levenshtein
人名匹配首选Jaro-Winkler
长文本使用余弦相似度
快速去重推荐Jaccard系数