快速上手知识图谱开源库AmpliGraph教程指南(二)
上一篇提到了pykeen,在【快速上手知识图谱开源库pykeen教程指南(一)】,今天上手试了一下AmpliGraph,感觉比pykeen好很多,还是蛮推荐的。
当然这个开源项目时间比较久,之前有网友已经写了一篇蛮详细的:【Ampligraph——基于tensorflow的python库,可用于知识图谱嵌入和链接预测】就是代码有点老,当时开源库还没有将tf1.x 升级到tf 2.x,现在的开源库做了一些改变,这里稍微再多讲一下这个开源库。
ampligraph文档
github文档
安装:
pip install "tensorflow==2.9.0"
pip install ampligraph
文章目录
- 1 TransE、DistMult、ComplEx、HolE、RotatE模型差异
- 1.1 TransE
- 1.2 DistMult
- 1.3 ComplEx
- 1.4 HolE
- 1.5 RotatE
- 2 官方code Examples
- 2.1 ComplEx模型训练和评估嵌入模型
- 2.2 ScoringBasedEmbeddingModel核心函数
- 2.3 TransE模型示例代码
- 2.4 直接获取实体向量
- 3 个人案例
- 3.1 RotatE模型训练
- 3.2 model.get_indexes 查询实体ID
- 3.3 三元组预测
- 3.4 实体、关系预测 query_topn
- 3.5 模型保存与加载
- 3.6 实体聚类 + PCA二维画图
1 TransE、DistMult、ComplEx、HolE、RotatE模型差异
模型 | 评分函数公式 | 计算原理 | 数值方向性 | 正负意义 |
---|---|---|---|---|
TransE | f ( h , r , t ) = − ∣ h + r − t ∣ p f(h, r, t) = - | h + r - t |_p f(h,r,t)=−∣h+r−t∣p | 距离越小 → 分数绝对值越小(越大) | 越大越好(负值) | 仅比较绝对值大小 |
DistMult | f ( h , r , t ) = h T diag ( r ) t f(h, r, t) = h^T \text{diag}(r) t f(h,r,t)=hTdiag(r)t | 点积越大 → 相似度越高 | 越大越好(可正可负) | 正数表示合理 |
ComplEx | f ( h , r , t ) = Re ( h ⊙ r ‾ ⊙ t ) f(h, r, t) = \text{Re}(h \odot \overline{r} \odot t) f(h,r,t)=Re(h⊙r⊙t) | 复数空间相似性度量 | 越大越好(可正可负) | 正数表示合理 |
HolE | f ( h , r , t ) = r T ( h ⋆ t ) f(h, r, t) = r^T (h \star t) f(h,r,t)=rT(h⋆t) | 循环相关后的相似度 | 越大越好(可正可负) | 正数表示合理 |
RotatE | f ( h , r , t ) = − ∣ h ⊙ r − t ∣ 2 f(h, r, t) = - | h \odot r - t |_2 f(h,r,t)=−∣h⊙r−t∣2 | 欧氏距离,旋转后距离越小 → 分数绝对值越小 | 越大越好(负值) | 仅比较绝对值大小 |
数值方向性总结:所有模型均为数值越大越合理,但需注意:
- TransE 和 RotatE 的分数是负距离,绝对值越小(即越接近 0)越好 。
- DistMult、ComplEx、HolE 的分数直接表示相似度,正数更可信 。
模型 | 核心优势 | 局限性 | 适用场景 |
---|---|---|---|
TransE | 简单高效,适合一对一关系 | 无法处理多对多/对称关系 | 简单图谱补全 |
DistMult | 计算高效,适合对称关系 | 无法建模非对称关系 | 对称关系推理 |
ComplEx | 支持非对称关系,复数空间增强 | 计算复杂度较高 | 复杂关系推理 |
HolE | 压缩计算,适合多对多关系 | 表达能力受限于循环相关 | 大规模图谱处理 |
RotatE | 支持对称/逆关系,旋转操作灵活 | 复数向量训练成本高 | 复杂逻辑推理(如家族关系) |
1.跨模型不可比性:不同模型的评分函数和数值范围差异大(如 TransE 的负距离与 DistMult 的点积),仅支持同一模型内分数比较 。
2. 训练目标:所有模型通过负采样优化,最大化正例分数与负例分数的边界差(Margin-based Loss)。
3. 应用建议:根据任务需求选择模型(如简单关系用 TransE,非对称关系用 ComplEx),并关注同一模型内的分数排名 。
1.1 TransE
- 原理:将实体和关系映射为低维向量,假设头实体向量加上关系向量应接近尾实体向量,即 h + r ≈ t h + r ≈ t h+r≈t,评分函数为负的欧氏距离: f ( h , r , t ) = − ∣ ∣ h + r − t ∣ ∣ p f(h,r,t) = -||h + r - t||_p f(h,r,t)=−∣∣h+r−t∣∣p 。
- 擅长:处理简单的一对一关系(如父子关系),计算效率高 。
1.2 DistMult
- 原理:通过张量分解建模三元组,关系表示为对角矩阵,评分函数为 f ( h , r , t ) = h T diag ( r ) t f(h,r,t) = h^T \text{diag}(r) t f(h,r,t)=hTdiag(r)t,即头尾实体与关系的点积 。
- 擅长:捕捉对称关系(如“朋友”关系),但无法处理非对称关系(如“领导”)。
1.3 ComplEx
- 原理:扩展 DistMult 到复数空间,评分函数为复数点积的实部: f ( h , r , t ) = Re ( h ⊙ r ∗ ⊙ t ) f(h,r,t) = \text{Re}(h \odot r^* \odot t) f(h,r,t)=Re(h⊙r∗⊙t),允许非对称关系建模 。
- 擅长:处理非对称关系(如“购买”关系),支持复杂语义推理 。
1.4 HolE
- 原理:通过循环相关操作压缩张量,评分函数为 f ( h , r , t ) = r T ( h ⋆ t ) f(h,r,t) = r^T (h \star t) f(h,r,t)=rT(h⋆t),结合实体间的局部交互特征 。
- 擅长:多对多关系(如“作者-论文”关系),计算复杂度低 。
1.5 RotatE
- 原理:在复数空间中用旋转操作建模关系,即 h ∘ r = t h \circ r = t h∘r=t,评分函数为负的欧氏距离: f ( h , r , t ) = − ∣ ∣ h ∘ r − t ∣ ∣ 2 f(h,r,t) = -||h \circ r - t||_2 f(h,r,t)=−∣∣h∘r−t∣∣2 。
- 擅长:支持对称/逆关系(如“配偶”和“反义词”),适合复杂逻辑推理 。
2 官方code Examples
示例代码来自:Examples
一些专有名词,
在知识图谱中,S、P、O 是三元组(Triple)的三个组成部分,分别代表:
- S(Subject):主体或主语,表示描述的核心实体或对象。例如,“苹果”。
- P(Predicate):谓词或关系,表示主体与客体之间的关联。例如,“是”。
- O(Object):客体或宾语,表示与主体相关联的实体或属性值。例如,“水果”。
2.1 ComplEx模型训练和评估嵌入模型
import numpy as np
from ampligraph.datasets import load_wn18
from ampligraph.latent_features import ScoringBasedEmbeddingModel
from ampligraph.evaluation import mrr_score, hits_at_n_score
from ampligraph.latent_features.loss_functions import get as get_loss
from ampligraph.latent_features.regularizers import get as get_regularizer
import tensorflow as tf# load Wordnet18 dataset:
X = load_wn18()# Initialize a ComplEx neural embedding model: the embedding size is k,
# eta specifies the number of corruptions to generate per each positive,
# scoring_type determines the scoring function of the embedding model.
model = ScoringBasedEmbeddingModel(k=150,eta=10,scoring_type='ComplEx')# Optimizer, loss and regularizer definition
optim = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss = get_loss('pairwise', {'margin': 0.5})
regularizer = get_regularizer('LP', {'p': 2, 'lambda': 1e-5})# Compilation of the model
model.compile(optimizer=optim, loss=loss, entity_relation_regularizer=regularizer)# For evaluation, we can use a filter which would be used to filter out
# positives statements created by the corruption procedure.
# Here we define the filter set by concatenating all the positives
filter = {'test' : np.concatenate((X['train'], X['valid'], X['test']))}# Early Stopping callback
checkpoint = tf.keras.callbacks.EarlyStopping(monitor='val_{}'.format('hits10'),min_delta=0,patience=5,verbose=1,mode='max',restore_best_weights=True
)# Fit the model on training and validation set
model.fit(X['train'],batch_size=int(X['train'].shape[0] / 10),epochs=20, # Number of training epochsvalidation_freq=20, # Epochs between successive validationvalidation_burn_in=100, # Epoch to start validationvalidation_data=X['valid'], # Validation datavalidation_filter=filter, # Filter positives from validation corruptionscallbacks=[checkpoint], # Early stopping callback (more from tf.keras.callbacks are supported)verbose=True # Enable stdout messages)# Run the evaluation procedure on the test set (with filtering)
# To disable filtering: use_filter=None
# Usually, we corrupt subject and object sides separately and compute ranks
#在测试集上运行评估过程(带过滤)
# 可以通过filter_triples=None不进行过滤
ranks = model.evaluate(X['test'],use_filter=filter,corrupt_side='s,o')# compute and print metrics:
mrr = mrr_score(ranks)
hits_10 = hits_at_n_score(ranks, n=10)
print("MRR: %f, Hits@10: %f" % (mrr, hits_10))
# Output: MRR: 0.884418, Hits@10: 0.935500
2.2 ScoringBasedEmbeddingModel核心函数
核心函数ScoringBasedEmbeddingModel(k=5, eta=1, scoring_type='RotatE')
可以载入以下模型,也就是scoring_type
可以选择以下模型:
TransE
Translating embedding scoring function will be usedDistMult
DistMult embedding scoring function will be usedComplEx
ComplEx embedding scoring function will be usedHolE
Holograph embedding scoring function will be usedRotatE
RotatE embedding scoring function will be used
2.3 TransE模型示例代码
'''
模拟模型
'''
# Early stopping example
from ampligraph.latent_features import ScoringBasedEmbeddingModel
from ampligraph.datasets import load_fb15k_237
dataset = load_fb15k_237()
model = ScoringBasedEmbeddingModel(eta=1,k=10,scoring_type='TransE')
model.compile(optimizer='adam', loss='multiclass_nll')
import tensorflow as tf
early_stop = tf.keras.callbacks.EarlyStopping(monitor="val_mrr", # which metrics to monitorpatience=3, # If the monitored metric doesnt improve for these many checks the model early stopsverbose=1, # verbositymode="max", # how to compare the monitored metrics; "max" means higher is betterrestore_best_weights=True) # restore the weights with best value
# the early stopping instance needs to be passed as callback to fit function
model.fit(dataset['train'],batch_size=10000,epochs=5,validation_freq=1, # validation frequencyvalidation_batch_size=100, # validation batch sizevalidation_burn_in=3, # burn in timevalidation_corrupt_side='s,o', # which side to corruptvalidation_data=dataset['valid'][::100], # Validation datacallbacks=[early_stop]) # Pass the early stopping object as a callback
2.4 直接获取实体向量
import numpy as np
from ampligraph.latent_features import ScoringBasedEmbeddingModelmodel = ScoringBasedEmbeddingModel(k=5, eta=1, scoring_type='TransE')
model.compile(optimizer='adam', loss='nll')
X = np.array([['a', 'y', 'b'],['b', 'y', 'a'],['a', 'y', 'c'],['c', 'y', 'a'],['a', 'y', 'd'],['c', 'y', 'd'],['b', 'y', 'c'],['f', 'y', 'e']])
model.fit(X, epochs=5)
model.get_embeddings(['f','e'], embedding_type='e')
# Output
# [[ 0.5677353 0.65208733 0.66626084 0.7323714 0.43467668]
# [-0.7102897 0.59935296 0.17629518 0.5096843 -0.53681636]]
model.get_embeddings(['f','e'], embedding_type='e')
可以拿到两个实体的向量。
embedding_type 有两种参数可被选择:
- If
'e'
or'entities'
,entities
argument will be
considered as a list of knowledge graph entities (i.e. nodes). - If set to
'r'
or'relation'
, they will be treated as relation
types instead (i.e. predicates).
3 个人案例
3.1 RotatE模型训练
自己写一个案例,也就是Pykeen当中的自己写的示例代码
模型训练,自己选择的是RotatE
import numpy as np
from ampligraph.latent_features import ScoringBasedEmbeddingModel
from ampligraph.utils import save_model, restore_model#使用ComplEx模型
model = ScoringBasedEmbeddingModel(k=5, eta=1, scoring_type='RotatE')
model.compile(optimizer='adam', loss='nll')#简单数据集
triples = [# 消费者与会员权益关联("Consumer_ZhangWei", "uses_privilege", "Gold_Member_Discount"),("Consumer_LiNa", "uses_privilege", "Platinum_Free_Shipping"),("Consumer_WangQiang", "uses_privilege", "Silver_Coupon_100"),("Consumer_ChenXin", "uses_privilege", "Diamond_Exclusive_Gift"),# 消费者在门店购买商品(结合权益)("Consumer_ZhangWei", "purchased_at", "Store_Beijing_Chaoyang"),("Store_Beijing_Chaoyang", "sells_with_privilege", "Gold_Member_Discount"),("Consumer_ZhangWei", "buys", "Smartphone_X"),("Consumer_LiNa", "purchased_at", "Store_Shanghai_Pudong"),("Store_Shanghai_Pudong", "sells_with_privilege", "Platinum_Free_Shipping"),("Consumer_LiNa", "buys", "Laptop_Pro15"),("Consumer_WangQiang", "purchased_at", "Store_Guangzhou_Tianhe"),("Store_Guangzhou_Tianhe", "sells_with_privilege", "Silver_Coupon_100"),("Consumer_WangQiang", "buys", "Wireless_Earbuds"),("Consumer_ChenXin", "purchased_at", "Store_Beijing_Chaoyang"),("Store_Beijing_Chaoyang", "sells_with_privilege", "Diamond_Exclusive_Gift"),("Consumer_ChenXin", "buys", "4K_Monitor_27inch"),# 补充关系链("Smartphone_X", "category", "Electronics"),("Laptop_Pro15", "price_range", "High_End"),("Wireless_Earbuds", "brand", "SoundMaster"),("4K_Monitor_27inch", "stock_status", "In_Stock")
]X = np.array(triples)
model.fit(X, epochs=5)
以上的案例使用的是RotatE,比ComplEx精准
3.2 model.get_indexes 查询实体ID
这里可以使用model.get_indexes
查询不同实体、关系的ID是多少:
model.get_indexes(['Consumer_ZhangWei', 'Consumer_ChenXin'], 'e', 'raw2ind')
关于get_indexes
有以下的查询参数:
- raw2ind:实体变ID
- ind2raw:ID查回label名称
'''
get_indexes(self, X, type_of="t", order="raw2ind")
type_of:- triples (``type_of='t'``)- entities (``type_of='e'``)- relations (``type_of='r'``)
order:- data (``order='raw2ind'``)- indexes (``order='ind2raw'``)
'''
3.3 三元组预测
拿所有训练集进行预测,可以看到
y_pred_before = model.predict(X)
print(y_pred_before)
> [-2.50892 -2.2172801 -2.9533007 -2.7729113 -2.0113053 -1.6192987-2.1668677 -2.4653506 -1.5966015 -2.9358544 -1.3368915 -3.2936296-1.8905625 -2.3131676 -1.6629038 -2.3241253 -2.6047113 -1.9199837-1.7808825 -2.453228 ]
[-1.3484827 -2.8075533]
RotatE预测出来的都是负数,分数越大,越接近0,越好。
RotatE 的核心思想是通过复数空间中的旋转操作建模关系。其评分函数为:
f ( h , r , t ) = − ∥ h ⊙ r − t ∥ 2 f(h, r, t) = - \| h \odot r - t \|_2 f(h,r,t)=−∥h⊙r−t∥2
分数是负的欧氏距离,分数绝对值越小(即越接近 0),表示头实体经关系旋转后与尾实体的距离越近,三元组越合理
编造一个案例来看一下,其中("Consumer_LiNa", "buys", "Laptop_Pro15")
在原来源三元组中,而("Consumer_LiNa", "buys", "Store_Shanghai_Pudong")
不在。
y_pred_before = model.predict(np.array([("Consumer_LiNa", "buys", "Store_Shanghai_Pudong"),\("Consumer_LiNa", "buys", "Laptop_Pro15")])) # 一真一假
print(y_pred_before)
> [-1.3484827 -2.9358544]
此时可以看到分数有点乱套啊?啥情况?
3.4 实体、关系预测 query_topn
这里是看到官方网站的:notebooks/AmpliGraph-Tutorials/Discovery 4 - Query TopN.ipynb
from ampligraph.discovery import query_topnquery_topn(model, top_n=3,head='Consumer_ZhangWei', relation='purchased_at', tail=None,ents_to_consider=None, rels_to_consider=None)# query_topn(model, top_n=3,
# head='Consumer_ZhangWei', relation='purchased_at', tail='Store_Shanghai_Pudong',
# ents_to_consider=None, rels_to_consider=None)
# 报错:ValueError: Exactly one of `head`, `relation` or `tail` arguments must be None.# 不能这么写
query_topn(model, top_n=3,head='Consumer_ZhangWei', tail='Store_Shanghai_Pudong',ents_to_consider=None, rels_to_consider=None)# 不能这么写
query_topn(model, top_n=3,relation='purchased_at', tail='Store_Shanghai_Pudong',ents_to_consider=None, rels_to_consider=None)# 限定关联的实体
query_topn(model, top_n=3,head='Consumer_ZhangWei', relation='purchased_at',ents_to_consider=['Consumer_ZhangWei','Store_Beijing_Chaoyang','Gold_Member_Discount'], rels_to_consider=None)
query_topn是进行关系、实体的预测,
其中head、relation、tail只能三选二,不能全部给到,也不能写一个
官方文档中test_discovery.py
也有一些案例:
def test_query_topn():X = np.array([['a', 'y', 'b'],['b', 'y', 'a'],['a', 'y', 'c'],['c', 'y', 'a'],['a', 'y', 'd'],['c', 'x', 'd'],['b', 'y', 'c'],['f', 'y', 'e'],['a', 'z', 'f'],['c', 'z', 'f'],['b', 'z', 'f'],])model = ScoringBasedEmbeddingModel(eta=5, k=10,scoring_type='ComplEx')model.compile(optimizer='adam', loss='multiclass_nll')model.fit(X, batch_size=2, epochs=10)with pytest.raises(ValueError): # Model not fittedquery_topn(model, top_n=2)with pytest.raises(ValueError):query_topn(model, top_n=2)with pytest.raises(ValueError):query_topn(model, top_n=2, head='a')with pytest.raises(ValueError):query_topn(model, top_n=2, relation='y')with pytest.raises(ValueError):query_topn(model, top_n=2, tail='e')with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', relation='y', tail='e')with pytest.raises(ValueError):query_topn(model, top_n=2, head='xx', relation='y')with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', relation='yakkety')with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', tail='sax')with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', relation='x', rels_to_consider=['y', 'z'])with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', tail='f', rels_to_consider=['y', 'z', 'error'])with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', tail='e', rels_to_consider='y')with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', relation='x', ents_to_consider=['zz', 'top'])with pytest.raises(ValueError):query_topn(model, top_n=2, head='a', tail='e', ents_to_consider=['a', 'b'])subj, pred, obj, top_n = 'a', 'x', 'e', 3Y, S = query_topn(model, top_n=top_n, head=subj, relation=pred)assert len(Y) == len(S)assert len(Y) == top_nassert np.all(Y[:, 0] == subj)assert np.all(Y[:, 1] == pred)Y, S = query_topn(model, top_n=top_n, relation=pred, tail=obj)assert np.all(Y[:, 1] == pred)assert np.all(Y[:, 2] == obj)ents_to_con = ['a', 'b', 'c', 'd']Y, S = query_topn(model, top_n=top_n, relation=pred, tail=obj, ents_to_consider=ents_to_con)assert np.all([x in ents_to_con for x in Y[:, 0]])rels_to_con = ['y', 'x']Y, S = query_topn(model, top_n=100, head=subj, tail=obj, rels_to_consider=rels_to_con)assert np.all([x in rels_to_con for x in Y[:, 1]])Y, S = query_topn(model, top_n=100, relation=pred, tail=obj)assert all(S[i] >= S[i + 1] for i in range(len(S) - 1))
3.5 模型保存与加载
# Save the model
example_name = "helloworld.pkl"
save_model(model, model_name_path=example_name)# Restore the model
restored_model = restore_model(model_name_path=example_name)# Use the restored model to predict
y_pred_after = restored_model.predict(
np.array([("Consumer_LiNa", "buys", "Store_Shanghai_Pudong"),\("Smartphone_X", "buys", "Laptop_Pro15")])
)
print(y_pred_after)
# [ 0.1416718 -0.0070735]
3.6 实体聚类 + PCA二维画图
需要下载:
pip install adjustText
pip install pycountry_convert
import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from adjustText import adjust_text
import pycountry_convert as pc
from ampligraph.discovery import find_clusters# Get the teams entities and their corresponding embeddings
triples_df = pd.DataFrame(X, columns=['s', 'p', 'o'])
teams = triples_df.s.unique()
team_embeddings = dict(zip(teams, model.get_embeddings(teams)))
team_embeddings_array = np.array([i for i in team_embeddings.values()])# Project embeddings into 2D space via PCA in order to plot them
embeddings_2d = PCA(n_components=2).fit_transform(team_embeddings_array)# Cluster embeddings (on the original space)
clustering_algorithm = KMeans(n_clusters=3, n_init=100, max_iter=500, random_state=0)
clusters = find_clusters(teams, model, clustering_algorithm, mode='e')plot_df = pd.DataFrame({"teams": teams,"embedding1": embeddings_2d[:, 0],"embedding2": embeddings_2d[:, 1],#"continent": pd.Series(teams).apply(cn_to_ctn),"cluster": "cluster" + pd.Series(clusters).astype(str)})
plot_dfnp.random.seed(0)# Plot 2D embeddings with country labels
def plot_clusters(hue):plt.figure(figsize=(12, 12))plt.title("{} embeddings".format(hue).capitalize())ax = sns.scatterplot(data=plot_df,x="embedding1", y="embedding2", hue=hue) # hue散点分类的列,pandas的列名texts = []for i, point in plot_df.iterrows():texts.append(plt.text(point['embedding1'] + 0.02,point['embedding2'] + 0.01,str(point["teams"])))adjust_text(texts)#plt.savefig(hue + '_cluster_ex.png')#plot_clusters("continent")
plot_clusters("cluster")
逻辑很简单:
- 获取
model.get_embeddings(teams)
所有实体的向量 - 降维这些向量为二维,
PCA(n_components=2).fit_transform(team_embeddings_array)
画出来如下: