当前位置: 首页 > web >正文

不同类型的语义相似度损失函数(SentenceTransformerLoss)

文章目录

  • 不同输入类型的损失
  • 输入类型:[(anchor, positive/negative, label 1/0)...],label为1距离小、为0距离大
    • ContrastiveLoss(对比损失)
    • OnlineContrastiveLoss
  • 输入类型:[(sentence1, label1), (sentence2, label2)...],label相同则距离小
    • BatchAllTripletLoss
    • BatchHardSoftMarginTripletLoss
    • BatchHardTripletLoss
  • 输入类型:[(sentence1, sentence2, score), ...], 拟合sentence pair的score(大于0小于1)
    • CosineSimilarityLoss(相似度回归)
    • CoSENTLoss(相似度回归和排序任务)
  • 输入类型:[(sentence1, sentence2, label), ...], 多分类sentence pair
    • SoftmaxLoss
  • 输入类型:[(anchor, positive, negative), ...], 三元组样本对输入
    • TripletLoss
    • MultipleNegativesRankingLoss / InfoNCELoss
    • CachedMultipleNegativesRankingLoss
  • 输入类型:[(anchor, positive), ...], 仅正样本对输入
  • 输入类型:[sentence1, sentence2, ...],无标签输入


不同输入类型的损失

根据任务数据类型选择合适的损失,详见这里。
在这里插入图片描述


输入类型:[(anchor, positive/negative, label 1/0)…],label为1距离小、为0距离大

ContrastiveLoss(对比损失)

对于样本对A和B:

  • 正样本对(类别为1),它们之间的距离应尽可能近;
  • 负样本对(类别为0),它们之间的距离应尽可能远,只惩罚距离小于margin的负样本对,距离超过阈值时不再惩罚;

distance_metric默认为余弦距离,margin默认为0.5,lossd^2(a,p) + max(margin - d^2(a,n), 0)

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]assert len(reps) == 2rep_anchor, rep_other = repsdistances = self.distance_metric(rep_anchor, rep_other)losses = 0.5 * (labels.float() * distances.pow(2) + (1 - labels).float() * F.relu(self.margin - distances).pow(2))return losses.mean() if self.size_average else losses.sum()

OnlineContrastiveLoss

ContrastiveLoss基本相同,该loss仅选择批次内困难样本计算损失,通常效果比对比损失更优。

损失:选择距离小于最大正样本对距离的负样本,选择距离大于最小负样本对距离的正样本。忽略负样本对最小距离正样本对最大距离的差超过阈值的easy实例。

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor, size_average=False) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]distance_matrix = self.distance_metric(embeddings[0], embeddings[1])negs = distance_matrix[labels == 0]poss = distance_matrix[labels == 1]# select hard positive and hard negative pairsnegative_pairs = negs[negs < (poss.max() if len(poss) > 1 else negs.mean())]positive_pairs = poss[poss > (negs.min() if len(negs) > 1 else poss.mean())]positive_loss = positive_pairs.pow(2).sum()negative_loss = F.relu(self.margin - negative_pairs).pow(2).sum()loss = positive_loss + negative_lossreturn loss

输入类型:[(sentence1, label1), (sentence2, label2)…],label相同则距离小

BatchAllTripletLoss

损失度量:

  • 批次内具有相同标签的句子属于同一类,距离应近;
  • 批次内具有不同标签的句子属于不同类,距离应远;
  • 对于任意锚点样本,其与具有相同标签样本的距离应小于与其具有不同标签样本的距离;

比如对于四个样本[(a, label1), (b, label1), (c, label2), (d, label2)],则pairwise_dist为[[aa, ab, ac, ad], ..., [da, db, dc, dd]]。若a作为锚点,ab正样本对距离,ac为负样本对距离,loss中的其中一项为ab-ac+margin

正样本对距离越大,负样本对距离越小,则损失越大。忽略距离差大于margin的正负样本对,即ab-ac+margin<0,这种样本对容易区分,对损失影响不大。

def batch_all_triplet_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:# Get the pairwise distance matrixpairwise_dist = self.distance_metric(embeddings)anchor_positive_dist = pairwise_dist.unsqueeze(2)anchor_negative_dist = pairwise_dist.unsqueeze(1)# Compute a 3D tensor of size (batch_size, batch_size, batch_size)# triplet_loss[i, j, k] will contain the triplet loss of anchor=i, positive=j, negative=k# Uses broadcasting where the 1st argument has shape (batch_size, batch_size, 1)# and the 2nd (batch_size, 1, batch_size)triplet_loss = anchor_positive_dist - anchor_negative_dist + self.triplet_margin# Put to zero the invalid triplets# (where label(a) != label(p) or label(n) == label(a) or a == p)mask = BatchHardTripletLoss.get_triplet_mask(labels)triplet_loss = mask.float() * triplet_loss# Remove negative losses (i.e. the easy triplets)triplet_loss[triplet_loss < 0] = 0# Count number of positive triplets (where triplet_loss > 0)valid_triplets = triplet_loss[triplet_loss > 1e-16]num_positive_triplets = valid_triplets.size(0)# num_valid_triplets = mask.sum()# fraction_positive_triplets = num_positive_triplets / (num_valid_triplets.float() + 1e-16)# Get final mean triplet loss over the positive valid tripletstriplet_loss = triplet_loss.sum() / (num_positive_triplets + 1e-16)return triplet_loss

BatchHardSoftMarginTripletLoss

批次内任一锚点,与相同标签样本的最大距离也要比与不同标签的最小距离更近,同类样本即使远也要比非同类样本的距离近。

使用软间隔,loss=log(1 + exp(d(a, p) - d(a, n)))。正负样本对距离相近时,损失变化速率最快,易优化;正样本对距离远小于负样本距离时,损失趋于0。

def batch_hard_triplet_soft_margin_loss(self, labels: Tensor, embeddings: Tensor) -> Tensor:# Get the pairwise distance matrixpairwise_dist = self.distance_metric(embeddings)# For each anchor, get the hardest positive# First, we need to get a mask for every valid positive (they should have same label)mask_anchor_positive = BatchHardTripletLoss.get_anchor_positive_triplet_mask(labels).float()# We put to 0 any element where (a, p) is not valid (valid if a != p and label(a) == label(p))anchor_positive_dist = mask_anchor_positive * pairwise_dist# shape (batch_size, 1)hardest_positive_dist, _ = anchor_positive_dist.max(1, keepdim=True)# For each anchor, get the hardest negative# First, we need to get a mask for every valid negative (they should have different labels)mask_anchor_negative = BatchHardTripletLoss.get_anchor_negative_triplet_mask(labels).float()# We add the maximum value in each row to the invalid negatives (label(a) == label(n))max_anchor_negative_dist, _ = pairwise_dist.max(1, keepdim=True)anchor_negative_dist = pairwise_dist + max_anchor_negative_dist * (1.0 - mask_anchor_negative)# shape (batch_size,)hardest_negative_dist, _ = anchor_negative_dist.min(1, keepdim=True)# Combine biggest d(a, p) and smallest d(a, n) into final triplet loss with soft margin# tl = hardest_positive_dist - hardest_negative_dist + margin# tl[tl < 0] = 0tl = torch.log1p(torch.exp(hardest_positive_dist - hardest_negative_dist))triplet_loss = tl.mean()return triplet_loss

BatchHardTripletLoss

BatchHardSoftMarginTripletLoss不同的是,手动设置间隔,loss = d(a, p) - d(a, n) + margin,令loss[loss < 0] = 0,忽略正负距离相差超过阈值的样本对。


输入类型:[(sentence1, sentence2, score), …], 拟合sentence pair的score(大于0小于1)

CosineSimilarityLoss(相似度回归)

计算样本对之间的余弦相似分数,和标签分数做MSE损失。cos_score_transformation默认不执行任何操作,loss_fct默认为MSE损失。

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]output = self.cos_score_transformation(torch.cosine_similarity(embeddings[0], embeddings[1]))return self.loss_fct(output, labels.float().view(-1))

CoSENTLoss(相似度回归和排序任务)

Cosine Sentence Loss,远离参考科学空间——CoSENT(一):比Sentence-BERT更有效的句向量方案。

损失:对于句对(i, j)(k,l),若标签label[i,j] < label[k,l],则期望模型预测的相似度scores[i,j] < scores[k,l]。损失定义为loss=log(1 + exp(s[i,j] - s[k,l]) + exp...),即期望(i,j)的相似分数小于(k,l)

相似分数度量:余弦相似分数score,1表示相似,0表示不相似。这里不是距离是相似分数,训练完成后,不同向量之间的 余弦距离表示语义相似度, 适用于句子相似度回归和排序任务

比如batch内3对样本编号1,2和3,真值labels为(0.1, 0.7, 0.9),则样本对(1, 2), (1, 3), (2, 3)参与计算loss。如果预测scores为(0.3, 0.4, 0.2),差值分数为(-0.1, 0.1, 0.2),差值分数为正,则损失更大!

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]scores = self.similarity_fct(embeddings[0], embeddings[1])scores = scores * self.scalescores = scores[:, None] - scores[None, :]# label matrix indicating which pairs are relevantlabels = labels[:, None] < labels[None, :]labels = labels.float()# mask out irrelevant pairs so they are negligible after exp()scores = scores - (1 - labels) * 1e12# append a zero as e^0 = 1scores = torch.cat((torch.zeros(1).to(scores.device), scores.view(-1)), dim=0)loss = torch.logsumexp(scores, dim=0)return loss

输入类型:[(sentence1, sentence2, label), …], 多分类sentence pair

SoftmaxLoss

孪生网络,文本对多分类。

model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({"sentence1": ["A person on a horse jumps over a broken down airplane.","A person on a horse jumps over a broken down airplane.","A person on a horse jumps over a broken down airplane.","Children smiling and waving at camera",],"sentence2": ["A person is training his horse for a competition.","A person is at a diner, ordering an omelette.","A person is outdoors, on a horse.","There are children present.",],"label": [1, 2, 0, 0],
})
loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3)
 def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor | tuple[Tensor, Tensor]:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]rep_a, rep_b = repsvectors_concat = []if self.concatenation_sent_rep:vectors_concat.append(rep_a)vectors_concat.append(rep_b)if self.concatenation_sent_difference:vectors_concat.append(torch.abs(rep_a - rep_b))if self.concatenation_sent_multiplication:vectors_concat.append(rep_a * rep_b)features = torch.cat(vectors_concat, 1)output = self.classifier(features)if labels is not None:loss = self.loss_fct(output, labels.view(-1))return losselse:return reps, output

输入类型:[(anchor, positive, negative), …], 三元组样本对输入

TripletLoss

锚点与正负样本之间的距离要大于margin,也就是说,惩罚dis(anchor,neg) - dis(anhor,pos)<margin的三元组。默认distance_metric为欧式距离,margin为5。

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]rep_anchor, rep_pos, rep_neg = repsdistance_pos = self.distance_metric(rep_anchor, rep_pos)distance_neg = self.distance_metric(rep_anchor, rep_neg)losses = F.relu(distance_pos - distance_neg + self.triplet_margin)return losses.mean()

MultipleNegativesRankingLoss / InfoNCELoss

任意锚点样本,包含一条正样本和多条负样本。计算锚点和正、负样本之间的相似度,使用softmax多分类。增加锚点与正样本之间的相似度,降低锚点与负样本之间的相似度。

等价于InfoNCE loss,在softmax之间对score进行温度缩放。MultipleNegativesRankingLoss里面就是scale参数,scale=1就是标签的交叉熵损失。

def forward(self, sentence_features: Iterable[dict[str, Tensor]], labels: Tensor) -> Tensor:# Compute the embeddings and distribute them to anchor and candidates (positive and optionally negatives)embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]anchors = embeddings[0]  # (batch_size, embedding_dim)candidates = torch.cat(embeddings[1:])  # (batch_size * (1 + num_negatives), embedding_dim)# For every anchor, we compute the similarity to all other candidates (positives and negatives),# also from other anchors. This gives us a lot of in-batch negatives.scores = self.similarity_fct(anchors, candidates) * self.scale# (batch_size, batch_size * (1 + num_negatives))# anchor[i] should be most similar to candidates[i], as that is the paired positive,# so the label for anchor[i] is irange_labels = torch.arange(0, scores.size(0), device=scores.device)return self.cross_entropy_loss(scores, range_labels)

CachedMultipleNegativesRankingLoss

MultipleNegativesRankingLoss的优化版本,将批次中的样本分多个mini-batch,缓存梯度,避免OOM。


输入类型:[(anchor, positive), …], 仅正样本对输入

可使用MultipleNegativesRankingLoss损失,将batch内其它样本对的positive作为自身的negatives,执行softmax分类。批次内样本数越多,越难分类,预期效果越好。


输入类型:[sentence1, sentence2, …],无标签输入

无标签输入。

可使用ContrastiveTensionLossInBatchNegatives损失,同一句子执行两次forward(网络中包含dropout等随机操作),目标是使不同forward之间同句接近、不同句远离。

http://www.xdnf.cn/news/12317.html

相关文章:

  • windows环境Google-sparsehash安装
  • Python语法进阶篇 --- 封装、继承、多态
  • ObservableRecipient与ObservableObject
  • 基于rpc框架Dubbo实现的微服务转发实战
  • Android实现轮播图
  • Vue---vue使用AOS(滚动动画)库
  • 深度学习习题2
  • c++ 基于OpenSSL的EVP接口进行SHA3-512和SM3哈希计算
  • 广州邮科:引领嵌入式通信电源系统创新与发展
  • CMake指令:add_definitions
  • Profinet转CAN网关与西门子PLC的互联互通基础操作流程
  • 二叉树的遍历总结
  • 统信桌面专业版如何使用python开发平台jupyter
  • Kotlin 2.1 一元二次方程(顺序结构版)
  • three.js中使用tween.js的chain实现动画依次执行
  • 第09期_网站搭建_卡密验证 易如意1.71正式版 虚拟主机搭建笔记 软件卡密系统
  • 嵌入式学习 D33:系统编程--网路编程
  • dvwa12——XSS(Stored)
  • 回文数 - 力扣
  • Vue Router的核心实现原理深度解析
  • Python训练营打卡 Day45
  • RAID磁盘阵列
  • 算法:前缀和
  • 动态规划---股票问题
  • Job 运维类
  • JAVA数据库连接
  • Rocketmq消息队列 消息模型 详解
  • [蓝桥杯]全球变暖
  • Filebeat收集nginx日志到elasticsearch,最终在kibana做展示(二)
  • Next-AI聊天应用-复用chat组件