不同类型的语义相似度损失函数(SentenceTransformerLoss)
文章目录
- 不同输入类型的损失
- 输入类型:[(anchor, positive/negative, label 1/0)...],label为1距离小、为0距离大
- ContrastiveLoss(对比损失)
- OnlineContrastiveLoss
- 输入类型:[(sentence1, label1), (sentence2, label2)...],label相同则距离小
- BatchAllTripletLoss
- BatchHardSoftMarginTripletLoss
- BatchHardTripletLoss
- 输入类型:[(sentence1, sentence2, score), ...], 拟合sentence pair的score(大于0小于1)
- CosineSimilarityLoss(相似度回归)
- CoSENTLoss(相似度回归和排序任务)
- 输入类型:[(sentence1, sentence2, label), ...], 多分类sentence pair
- SoftmaxLoss
- 输入类型:[(anchor, positive, negative), ...], 三元组样本对输入
- TripletLoss
- MultipleNegativesRankingLoss / InfoNCELoss
- CachedMultipleNegativesRankingLoss
- 输入类型:[(anchor, positive), ...], 仅正样本对输入
- 输入类型:[sentence1, sentence2, ...],无标签输入
不同输入类型的损失
根据任务、数据类型选择合适的损失,详见这里。
输入类型:[(anchor, positive/negative, label 1/0)…],label为1距离小、为0距离大
ContrastiveLoss(对比损失)
对于样本对A和B:
- 正样本对(类别为1),它们之间的距离应尽可能近;
- 负样本对(类别为0),它们之间的距离应尽可能远,只惩罚距离小于margin的负样本对,距离超过阈值时不再惩罚;
distance_metric
默认为余弦距离,margin
默认为0.5,loss
为d^2(a,p) + max(margin - d^2(a,n), 0)
。
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]assert len(reps) == 2rep_anchor, rep_other = repsdistances = self.distance_metric(rep_anchor, rep_other)losses = 0.5 * (labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2))return losses.mean() if self.size_average else losses.sum()
OnlineContrastiveLoss
与ContrastiveLoss
基本相同,该loss仅选择批次内困难样本计算损失,通常效果比对比损失更优。
损失:选择距离小于最大正样本对距离的负样本,选择距离大于最小负样本对距离的正样本。忽略负样本对最小距离与正样本对最大距离的差超过阈值的easy
实例。
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor, size_average=False) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]distance_matrix = self.distance_metric(embeddings[0], embeddings[1])negs = distance_matrix[labels == 0]poss = distance_matrix[labels == 1]# select hard positive and hard negative pairsnegative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]positive_loss = positive_pairs.pow(2).sum()negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()loss = positive_loss + negative_lossreturn loss
输入类型:[(sentence1, label1), (sentence2, label2)…],label相同则距离小
BatchAllTripletLoss
损失度量:
- 批次内具有相同标签的句子属于同一类,距离应近;
- 批次内具有不同标签的句子属于不同类,距离应远;
- 对于任意锚点样本,其与具有相同标签样本的距离应小于与其具有不同标签样本的距离;
比如对于四个样本[(a, label1), (b, label1), (c, label2), (d, label2)]
,则pairwise_dist为[[aa, ab, ac, ad], ..., [da, db, dc, dd]]
。若a作为锚点,ab正样本对距离,ac为负样本对距离,loss中的其中一项为ab-ac+margin
。
正样本对距离越大,负样本对距离越小,则损失越大。忽略距离差大于margin的正负样本对,即ab-ac+margin<0
,这种样本对容易区分,对损失影响不大。
def batch_all_triplet_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:# Get the pairwise distance matrixpairwise_dist = self.distance_metric(embeddings)anchor_positive_dist = pairwise_dist.unsqueeze(2)anchor_negative_dist = pairwise_dist.unsqueeze(1)# Compute a 3D tensor of size (batch_size, batch_size, batch_size)# triplet_loss[i, j, k] will contain the triplet loss of anchor=i, positive=j, negative=k# Uses broadcasting where the 1st argument has shape (batch_size, batch_size, 1)# and the 2nd (batch_size, 1, batch_size)triplet_loss = anchor_positive_dist - anchor_negative_dist + self.triplet_margin# Put to zero the invalid triplets# (where label(a) != label(p) or label(n) == label(a) or a == p)mask = BatchHardTripletLoss.get_triplet_mask(labels)triplet_loss = mask.float() * triplet_loss# Remove negative losses (i.e. the easy triplets)triplet_loss[triplet_loss < 0] = 0# Count number of positive triplets (where triplet_loss > 0)valid_triplets = triplet_loss[triplet_loss > 1e-16]num_positive_triplets = valid_triplets.size(0)# num_valid_triplets = mask.sum()# fraction_positive_triplets = num_positive_triplets / (num_valid_triplets.float() + 1e-16)# Get final mean triplet loss over the positive valid tripletstriplet_loss = triplet_loss.sum() / (num_positive_triplets + 1e-16)return triplet_loss
BatchHardSoftMarginTripletLoss
批次内任一锚点,与相同标签样本的最大距离也要比与不同标签的最小距离更近,同类样本即使远也要比非同类样本的距离近。
使用软间隔,loss=log(1 + exp(d(a, p) - d(a, n)))
。正负样本对距离相近时,损失变化速率最快,易优化;正样本对距离远小于负样本距离时,损失趋于0。
def batch_hard_triplet_soft_margin_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:# Get the pairwise distance matrixpairwise_dist = self.distance_metric(embeddings)# For each anchor, get the hardest positive# First, we need to get a mask for every valid positive (they should have same label)mask_anchor_positive = BatchHardTripletLoss.get_anchor_positive_triplet_mask(labels).float()# We put to 0 any element where (a, p) is not valid (valid if a != p and label(a) == label(p))anchor_positive_dist = mask_anchor_positive * pairwise_dist# shape (batch_size, 1)hardest_positive_dist, _ = anchor_positive_dist.max(1, keepdim=True)# For each anchor, get the hardest negative# First, we need to get a mask for every valid negative (they should have different labels)mask_anchor_negative = BatchHardTripletLoss.get_anchor_negative_triplet_mask(labels).float()# We add the maximum value in each row to the invalid negatives (label(a) == label(n))max_anchor_negative_dist, _ = pairwise_dist.max(1, keepdim=True)anchor_negative_dist = pairwise_dist + max_anchor_negative_dist * (1.0 - mask_anchor_negative)# shape (batch_size,)hardest_negative_dist, _ = anchor_negative_dist.min(1, keepdim=True)# Combine biggest d(a, p) and smallest d(a, n) into final triplet loss with soft margin# tl = hardest_positive_dist - hardest_negative_dist + margin# tl[tl < 0] = 0tl = torch.log1p(torch.exp(hardest_positive_dist - hardest_negative_dist))triplet_loss = tl.mean()return triplet_loss
BatchHardTripletLoss
与BatchHardSoftMarginTripletLoss
不同的是,手动设置间隔,loss = d(a, p) - d(a, n) + margin
,令loss[loss < 0] = 0
,忽略正负距离相差超过阈值的样本对。
输入类型:[(sentence1, sentence2, score), …], 拟合sentence pair的score(大于0小于1)
CosineSimilarityLoss(相似度回归)
计算样本对之间的余弦相似分数,和标签分数做MSE损失。cos_score_transformation
默认不执行任何操作,loss_fct
默认为MSE损失。
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))return self.loss_fct(output, labels.float().view(-1))
CoSENTLoss(相似度回归和排序任务)
Cosine Sentence Loss,远离参考科学空间——CoSENT(一):比Sentence-BERT更有效的句向量方案。
损失:对于句对(i, j)
和(k,l)
,若标签label[i,j] < label[k,l]
,则期望模型预测的相似度scores[i,j] < scores[k,l]
。损失定义为loss=log(1 + exp(s[i,j] - s[k,l]) + exp...)
,即期望(i,j)
的相似分数小于(k,l)
!
相似分数度量:余弦相似分数score,1表示相似,0表示不相似。这里不是距离是相似分数,训练完成后,不同向量之间的 余弦距离表示语义相似度, 适用于句子相似度回归和排序任务。
比如batch内3对样本编号1,2和3,真值labels为(0.1, 0.7, 0.9),则样本对(1, 2), (1, 3), (2, 3)参与计算loss。如果预测scores为(0.3, 0.4, 0.2),差值分数为(-0.1, 0.1, 0.2),差值分数为正,则损失更大!
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]scores = self.similarity_fct(embeddings[0], embeddings[1])scores = scores * self.scalescores = scores[:, None] - scores[None, :]# label matrix indicating which pairs are relevantlabels = labels[:, None] < labels[None, :]labels = labels.float()# mask out irrelevant pairs so they are negligible after exp()scores = scores - (1 - labels) * 1e12# append a zero as e^0 = 1scores = torch.cat((torch.zeros(1).to(scores.device), scores.view(-1)), dim=0)loss = torch.logsumexp(scores, dim=0)return loss
输入类型:[(sentence1, sentence2, label), …], 多分类sentence pair
SoftmaxLoss
孪生网络,文本对多分类。
model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({"sentence1": ["A person on a horse jumps over a broken down airplane.","A person on a horse jumps over a broken down airplane.","A person on a horse jumps over a broken down airplane.","Children smiling and waving at camera",],"sentence2": ["A person is training his horse for a competition.","A person is at a diner, ordering an omelette.","A person is outdoors, on a horse.","There are children present.",],"label": [1, 2, 0, 0],
})
loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3)
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor | tuple[Tensor, Tensor]:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]rep_a, rep_b = repsvectors_concat = []if self.concatenation_sent_rep:vectors_concat.append(rep_a)vectors_concat.append(rep_b)if self.concatenation_sent_difference:vectors_concat.append(torch.abs(rep_a - rep_b))if self.concatenation_sent_multiplication:vectors_concat.append(rep_a * rep_b)features = torch.cat(vectors_concat, 1)output = self.classifier(features)if labels is not None:loss = self.loss_fct(output, labels.view(-1))return losselse:return reps, output
输入类型:[(anchor, positive, negative), …], 三元组样本对输入
TripletLoss
锚点与正负样本之间的距离要大于margin
,也就是说,惩罚dis(anchor,neg) - dis(anhor,pos)<margin
的三元组。默认distance_metric
为欧式距离,margin
为5。
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]rep_anchor, rep_pos, rep_neg = repsdistance_pos = self.distance_metric(rep_anchor, rep_pos)distance_neg = self.distance_metric(rep_anchor, rep_neg)losses = F.relu(distance_pos - distance_neg + self.triplet_margin)return losses.mean()
MultipleNegativesRankingLoss / InfoNCELoss
任意锚点样本,包含一条正样本和多条负样本。计算锚点和正、负样本之间的相似度,使用softmax多分类。增加锚点与正样本之间的相似度,降低锚点与负样本之间的相似度。
等价于InfoNCE loss
,在softmax之间对score进行温度缩放。MultipleNegativesRankingLoss
里面就是scale
参数,scale=1
就是标签的交叉熵损失。
def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:# Compute the embeddings and distribute them to anchor and candidates (positive and optionally negatives)embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]anchors = embeddings[0] # (batch_size, embedding_dim)candidates = torch.cat(embeddings[1:]) # (batch_size * (1 + num_negatives), embedding_dim)# For every anchor, we compute the similarity to all other candidates (positives and negatives),# also from other anchors. This gives us a lot of in-batch negatives.scores = self.similarity_fct(anchors, candidates) * self.scale# (batch_size, batch_size * (1 + num_negatives))# anchor[i] should be most similar to candidates[i], as that is the paired positive,# so the label for anchor[i] is irange_labels = torch.arange(0, scores.size(0), device=scores.device)return self.cross_entropy_loss(scores, range_labels)
CachedMultipleNegativesRankingLoss
MultipleNegativesRankingLoss
的优化版本,将批次中的样本分多个mini-batch,缓存梯度,避免OOM。
输入类型:[(anchor, positive), …], 仅正样本对输入
可使用MultipleNegativesRankingLoss
损失,将batch内其它样本对的positive作为自身的negatives,执行softmax分类。批次内样本数越多,越难分类,预期效果越好。
输入类型:[sentence1, sentence2, …],无标签输入
无标签输入。
可使用ContrastiveTensionLossInBatchNegatives
损失,同一句子执行两次forward(网络中包含dropout等随机操作),目标是使不同forward之间同句接近、不同句远离。