当前位置：首页 > news >正文

【论文速递】2025年09周（Robotics/Embodied AI/LLM）

news 2025/7/3 22:57:53

LLM-Microscope：揭示标点符号在Transformers的上下文中的隐藏作用
- 英文摘要
- 中文摘要
SurveyX：通过大型语言模型实现学术调查自动化
- 英文摘要
- 中文摘要
数学推理的自我奖励校正
- 英文摘要
- 中文摘要
VideoGrain：调整时空关注以进行多元透明视频编辑
- 英文摘要
- 中文摘要
SWE-RL：调节时空注意力机制实现多粒度视频编辑
- 英文摘要
- 中文摘要
Omnialign-V：增强MLLM与人类偏爱的对齐
- 英文摘要
- 中文摘要
长语境大型语言模型研究
- 英文摘要
- 中文摘要
Slamming：在一天内使用单个GPU训练语音语言模型
- 英文摘要
- 中文摘要
GHOST 2.0：生成式高保真单次头部迁移
- 英文摘要
- 中文摘要
Kanana：计算有效的双语语言模型
- 英文摘要
- 中文摘要
MEDVLM-R1：通过增强学习激励视觉模型（VLM）的医学推理能力
- 英文摘要
- 中文摘要
SpargeAttn：准确的稀疏注意力加速了任何模型推断
- 英文摘要
- 中文摘要
DICEPTION：视觉感知任务的通用扩散模型
- 英文摘要
- 中文摘要
定理解释代理：面向大语言模型定理理解的多模态解释
- 英文摘要
- 中文摘要
迈向AI联合科学家
- 英文摘要
- 中文摘要
R2-T2：多模态专家混合模型的测试时动态路由
- 英文摘要
- 中文摘要
Mol-Lalama：基于大模型的分子通用理解框架
- 英文摘要
- 中文摘要
PhotoDoodle：从少数成对数据中学习艺术图像编辑
- 英文摘要
- 中文摘要
MaskGWM：基于视频掩码重建的通用驾驶世界模型
- 英文摘要
- 中文摘要
NeoBERT：下一代 BERT
- 英文摘要
- 中文摘要
LongRoPE2：近距离LLM上下文窗口缩放
- 英文摘要
- 中文摘要
Audio-FLAN：初步版本
- 英文摘要
- 中文摘要
ART：可变多层透明图像生成的匿名区域Transformer
- 英文摘要
- 中文摘要
KV-Edit：无训练的图像编辑，用于精确背景保护
- 英文摘要
- 中文摘要
Plutus：低资源希腊金融中的大型语言模型的基准测试
- 英文摘要
- 中文摘要
语言模型的事实取决于查询的语言
- 英文摘要
- 中文摘要
SIFT：通过情境贴纸夯实大语言模型的推理基础
- 英文摘要
- 中文摘要

LLM-Microscope：揭示标点符号在Transformers的上下文中的隐藏作用

标题: LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
作者: Anton Razzhigaev, Matvey Mikhalchuk, Temurbek Rahmatullaev, Elizaveta Goncharova, Polina Druzhinina, Ivan Oseledets, Andrey Kuznetsov
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.15007
论文链接: https://arxiv.org/pdf/2502.15007
gitHub仓库: https://github.com/AIRI-Institute/LLM-Microscope

英文摘要

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens – especially stopwords, articles, and commas – consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer’s embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.

中文摘要

我们介绍了量化大型语言模型（LLMS）编码和存储上下文信息的方法，表明令牌通常被视为次要的（例如，确定词，标点符号）具有令人惊讶的高环境。值得注意的是，即使仅删除无关的令牌，删除这些代币（尤其是停止词，文章和逗号）也会始终降低对MMLU和Babilong-4K的性能。我们的分析还显示了上下文化与线性之间的密切相关性，其中线性度衡量了从一层的嵌入到下一个的转换如何通过单个线性映射近似。这些发现强调了填充令牌在维护上下文中的隐藏重要性。为了进一步探索，我们提出了LLM-Microscope，这是一种评估令牌级的非线性，评估上下文记忆，可视化中间层贡献（通过适应的logit镜头）并测量表示代表的内在维度。这个工具包阐明了看似微不足道的令牌对于远程理解至关重要。

SurveyX：通过大型语言模型实现学术调查自动化

标题: SurveyX: Academic Survey Automation via Large Language Models
作者: Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Simin Niu, Shichao Song, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14776
论文链接: https://arxiv.org/pdf/2502.14776
gitHub仓库: https://github.com/IAAR-Shanghai/SurveyX

英文摘要

Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called Attribute Tree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on www.surveyx.cn

中文摘要

大型语言模型（LLM）表现出了出色的理解能力和庞大的知识库，这表明LLM可以用作自动化调查生成的有效工具。但是，与自动调查生成有关的最新研究仍然受到一些临界局限性的限制，例如有限上下文窗口，缺乏深入的内容讨论以及缺乏系统评估框架。在人类写作过程的启发下，我们提出了SurveyX，这是一种自动化调查生成的高效且有组织的系统，将调查过程分解为两个阶段：制备和生成阶段。通过创新引入在线参考检索，一种称为属性树的预处理方法，以及重新抛光过程，Suressionx显着提高了调查组成的功效。实验评估结果表明，SurveyX在内容质量（0.259提高）和引文质量（1.76增强）方面的表现优于现有的自动化测量系统，从而在多个评估维度上接近人类专家表现。www.surveyx.cn提供了由Sureverx生成的调查的示例

数学推理的自我奖励校正

标题: Self-rewarding correction for mathematical reasoning
作者: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.19613
论文链接: https://arxiv.org/pdf/2502.19613

英文摘要

We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models’ ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

中文摘要

我们研究自我奖励推理大语言模型（LLMS），它们可以同时生成逐步推理并评估其在推理时间与外部反馈期间其产出的正确性。这种集成的方法允许单个模型独立指导其推理过程，从而为模型部署提供了计算优势。我们特别关注自我纠正的代表性任务，在该任务中，模型自主检测其响应中的错误，修改输出并决定何时终止迭代改进循环。为了实现这一目标，我们提出了一个两阶段的算法框架，用于仅使用自我生成数据构建自我奖励推理模型。在第一阶段，我们采用顺序排斥采样来合成长长的思想轨迹，既结合了自我奖励和自我纠正机制。这些策划数据的微调模型使他们能够学习自我奖励和自我纠正的模式。在第二阶段，我们进一步增强了模型通过基于规则的信号进行加强学习来评估响应准确性和完善产出的能力。Llama-3和QWEN-2.5实验表明，我们的方法超过了内在的自我校正功能，并且实现了与依靠外部奖励模型的系统相当的性能。

VideoGrain：调整时空关注以进行多元透明视频编辑

标题: VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
作者: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
日期: 2025-02-24
ArXiv主页: https://arxiv.org/abs/2502.17258
论文链接: https://arxiv.org/pdf/2502.17258
项目链接: https://knightyxp.github.io/VideoGrain_project_page
gitHub仓库: https://github.com/knightyxp/VideoGrain

英文摘要

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt’s attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

中文摘要

扩散模型的最新进展已大大提高了视频生成和编辑功能。但是，多元熟悉的视频编辑包括班级级别，实例级别和部分级别的修改，仍然是一个巨大的挑战。多元透彻编辑的主要困难包括在扩散模型中的文本对区域控制和特征耦合的语义错位。为了解决这些困难，我们提出了视频：一种零拍的方法，可调节时空（交叉和自我）注意机制，以实现对视频内容的细粒度控制。我们通过将每个当地提示的注意力放大其相应的空间 - 触发区域，同时最大程度地减少与跨注意区域无关区域的相互作用，从而增强文本对区域的控制。此外，我们通过提高区域内意识并减少自我注意力区域间干扰来改善特征分离。广泛的实验证明了我们的方法在现实情况下实现了最先进的表现。我们的代码，数据和演示可从https://knightyxp.github.io/videograin_project_page/获得

SWE-RL：调节时空注意力机制实现多粒度视频编辑

标题: SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
作者: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.18449
论文链接: https://arxiv.org/pdf/2502.18449
gitHub仓库: https://github.com/facebookresearch/swe-rl

英文摘要

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes and solutions by learning from extensive open-source software evolution data – the record of a software’s entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified – a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

中文摘要

最近发布的 DeepSeek-R1 展示了强化学习（Reinforcement Learning, RL）在提升大语言模型（LLMs）通用推理能力方面的巨大潜力。虽然 DeepSeek-R1 及其后续工作主要集中在将强化学习应用于编程竞赛和数学问题，但本文提出了 SWE-RL —— 首个将基于强化学习的语言模型推理方法扩展到真实世界软件工程任务的框架。SWE-RL 利用一种轻量级的基于规则的奖励机制（例如，真实解法与模型生成解法之间的相似度评分），使大语言模型能够从大规模开源软件演化数据中自主学习开发者的推理过程与解决方案。这些演化数据记录了一个软件的完整生命周期，包括代码快照、代码变更，以及诸如 issue 和 pull request 等事件。我们在 Llama 3 的基础上进行训练，得到了最终的推理模型 Llama3-SWE-RL-70B，在 SWE-bench Verified 上达到了 41.0% 的解决率 —— 这是一个由人工验证的真实 GitHub 问题集合。据我们所知，这是迄今为止中等规模（<100B 参数）大语言模型中表现最好的结果，甚至可以媲美当前领先的闭源模型如 GPT-4o。令人惊讶的是，尽管 SWE-RL 仅在软件演化数据上进行了强化学习训练，Llama3-SWE-RL 却展现出了泛化的推理能力。例如，在五个跨领域的任务上，包括函数编程、库使用、代码推理、数学问题以及通用语言理解任务，该模型的表现均有提升；而相比之下，监督微调基线模型在平均表现上反而有所下降。总体而言，SWE-RL 开辟了一条新路径：即通过在海量软件工程数据上应用强化学习来持续提升大语言模型的推理能力。

Omnialign-V：增强MLLM与人类偏爱的对齐

标题: OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
作者: Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.18411
论文链接: https://arxiv.org/pdf/2502.18411
项目链接: https://phoenixz810.github.io/OmniAlign-V/
gitHub仓库: https://github.com/open-compass/VLMEvalKit

英文摘要

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.

中文摘要

开源多模式大型语言模型（MLLM）的最新进展主要集中在增强基础能力上，从而在人类的偏好一致性方面存在很大的差距。本文介绍了Omnialign-V，这是一个全面的数据集，该数据集的200K高质量培训样本具有各种图像，复杂的问题和各种响应格式，以改善MLLM与人类偏好的一致性。我们还提出了MM-Alignbench，这是一种专门旨在评估MLLM与人类价值的对齐的人类宣传的基准。实验结果表明，使用有监督的微调（SFT）或直接偏好优化（DPO）将MLLM与Omnialign-V进行芬太尼，在维持标准VQA基准上保持或增强性能，可显着增强人类偏好比对，从而保持其基本功能。我们的数据集，基准，代码和检查点已在https://github.com/phoenixz810/omnialign-v上发布。

长语境大型语言模型研究

标题: Thus Spake Long-Context Large Language Model
作者: Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
日期: 2025-02-24
ArXiv主页: https://arxiv.org/abs/2502.17129
论文链接: https://arxiv.org/pdf/2502.17129
gitHub仓库: https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM

英文摘要

Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs) giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, the research on long-context LLMs has expanded from length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend its mortality. In this survey, We will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to the research on long-context LLMs.

中文摘要

漫长的上下文是自然语言处理（NLP）的重要主题，贯穿NLP体系结构的开发，并为大型语言模型（LLMS）提供了巨大的机会，从而使LLMS具有类似于人类的终身学习潜力。不幸的是，追求漫长的背景伴随着许多障碍。然而，长篇小说仍然是LLM的核心竞争优势。在过去的两年中，LLM的上下文长度已取得了对数百万个令牌的突破性扩展。此外，对长篇小说LLM的研究已从长度外推到对建筑，基础设施，培训和评估技术的全面关注。受《交响曲》诗的启发，因此，我们在扩展LLM背景的旅程与人类超越其死亡率的尝试之间进行了类比。在这项调查中，我们将说明LLM在更长的环境中的巨大需求与接受最终是有限的事实之间的巨大需求之间的斗争。为了实现这一目标，我们从四个角度（建筑，基础架构，培训和评估）展示了长篇小说技术的完整范围，从而给出了长篇文化LLM的生命周期的全球图片。在这项调查结束时，我们将提出长篇小说LLMS目前面临的10个未解决的问题。我们希望这项调查可以作为长篇文化LLM的研究的系统介绍。

Slamming：在一天内使用单个GPU训练语音语言模型

标题: Slamming: Training a Speech Language Model on One GPU in a Day
作者: Gallil Maimon, Avishai Elmakies, Yossi Adi
日期: 2025-02-19
ArXiv主页: https://arxiv.org/abs/2502.15814
论文链接: https://arxiv.org/pdf/2502.15814
项目链接: https://pages.cs.huji.ac.il/adiyoss-lab/slamming/
gitHub仓库: https://github.com/slp-rl/slamkit

英文摘要

We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

中文摘要

我们介绍了SLAM，这是一种在24小时内单个学术GPU上培训高质量语音语言模型（SLM）的食谱。我们通过对模型初始化和体系结构，合成训练数据的经验分析，合成数据的偏好优化以及调整所有其他组件。我们从经验上证明，这种培训配方在更大的计算结果中与领先SLM的相同，在计算成本的一小部分方面取得了良好的结果。我们希望这些见解能使SLM培训和研究更加易于访问。在SLM缩放定律的背景下，我们的结果远远超过了计算的最佳性能，从而对SLM的可行性具有乐观的看法。请参阅-https：//pages.cs.huji.ac.il/adiyoss-lab/slamming，请参见代码，数据，模型，样本。

GHOST 2.0：生成式高保真单次头部迁移

标题: GHOST 2.0: generative high-fidelity one shot transfer of heads
作者: Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.18417
论文链接: https://arxiv.org/pdf/2502.18417

英文摘要

While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0

中文摘要

尽管面部交换的任务最近在研究界引起了人们的关注，但头部交换的相关问题仍未得到探索。除了肤色转移外，头交换还带来了额外的挑战，例如需要在合成过程中保留整个头部的结构信息以及交换头和背景之间的涂料差距。在本文中，我们用Ghost 2.0解决了这些问题，该问题由两个特定问题的模块组成。首先，我们引入了用于头部重新制定的增强型对齐器模型，该模型以多个尺度保留身份信息，并且对极端姿势变化是可靠的。其次，我们使用搅拌机模块，该模块通过转移肤色和介入不匹配的区域将重新成型的头部无缝整合到目标背景中。这两个模块在相应的任务上的基准都优于基线，从而实现了最新的状态，从而导致头部交换。我们还解决了复杂的病例，例如源和靶标的发型差异很大。代码可从https://github.com/ai-forever/ghost-2.0获得。

Kanana：计算有效的双语语言模型

标题: Kanana: Compute-efficient Bilingual Language Models
作者: Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.18934
论文链接: https://arxiv.org/pdf/2502.18934
gitHub仓库: https://github.com/kakao/kanana

英文摘要

We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.

中文摘要

我们介绍了Kanana，这是一系列双语语言模型，这些模型表明韩语的表现超出了英语的竞争性能。卡纳纳的计算成本明显低于相似大小的最先进模型的计算成本。该报告详细介绍了在预训练期间采用的技术，以实现计算效率但竞争性的模型，包括高质量的数据过滤，上演的预训练，深度缩小以及修剪和蒸馏。此外，该报告概述了卡纳纳模型训练后使用的方法，包括监督的微调和偏好优化，旨在增强其与用户无缝互动的能力。最后，该报告详细阐述了用于语言模型适应特定方案的合理方法，例如嵌入，检索增强生成和功能调用。Kanana型号系列将公开发布2.1b型号（基础，指导，嵌入）从2.1b到32.5b参数，以促进对韩国语言模型的研究。

MEDVLM-R1：通过增强学习激励视觉模型（VLM）的医学推理能力

标题: MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
作者: Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.19634
论文链接: https://arxiv.org/pdf/2502.19634

英文摘要

Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.

中文摘要

推理是进行医学图像分析的关键领域，在临床医生信任和监管批准中，透明度和可信赖性在中心作用。尽管医学视觉语言模型（VLM）对放射学任务显示出希望，但大多数现有的VLM仅产生最终答案而没有揭示潜在的推理。为了解决这一差距，我们引入了MedVLM-R1，这是一种医学VLM，明确产生自然语言推理以提高透明度和可信度。MedVLM-R1不依靠监督的微调（SFT）（SFT）遭受过度拟合的培训分布，而无法促进真正的推理，而是采用了一个强化学习框架，该框架激励该模型发现无需使用任何推理参考文献而发现人类破解的推理路径。尽管培训数据有限（600个视觉问题回答样品）和模型参数（2B），但MRI，CT和X射线基准测试的MEDVLM-R1仍将准确性从55.11％提高到78.22％，超过对一百万多个样品的较大型号的表现。它还在分布任务下证明了强大的域概括。通过用明确的推理统一医学图像分析，MEDVLM-R1标志着在临床实践中朝着值得信赖和可解释的AI迈出的关键步骤。

SpargeAttn：准确的稀疏注意力加速了任何模型推断

标题: SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
作者: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.18137
论文链接: https://arxiv.org/pdf/2502.18137
gitHub仓库: https://github.com/thu-ml/SpargeAttn

英文摘要

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.

中文摘要

由于其二次时间的复杂性，有效的注意力实现对于大型模型至关重要。幸运的是，注意力通常表现出稀疏性，即注意图中的许多值接近零，从而允许省略相应的计算。许多研究利用稀疏模式来加速注意力。但是，大多数现有作品都集中在利用注意图的某些稀疏模式来优化特定模型中的注意力。普遍的稀疏关注可以确保各种模型的加速和端到端性能仍然难以捉摸。在本文中，我们提出了SpargeAttn，这是任何模型的通用稀疏和量化的关注。我们的方法使用了两个阶段的在线过滤器：在第一阶段，我们迅速准确地预测了注意力图，从而可以跳过一些矩阵乘法。在第二阶段，我们设计了一个在线软磁性过滤器，该过滤器不会造成额外的开销，并进一步跳过一些矩阵乘法。实验表明，我们的方法显着加速了不同的模型，包括语言，图像和视频生成，而无需牺牲端到端指标。这些代码可在https://github.com/thu-ml/spargeattn上找到。

DICEPTION：视觉感知任务的通用扩散模型

标题: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
作者: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
日期: 2025-02-24
ArXiv主页: https://arxiv.org/abs/2502.17157
论文链接: https://arxiv.org/pdf/2502.17157
项目链接: https://aim-uofa.github.io/Diception/

英文摘要

Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models.

中文摘要

我们的主要目标是创建一个良好的通才感知模型，该模型可以在计算资源和培训数据的限制内处理多个任务。为了实现这一目标，我们求助于在数十亿张图像上预先训练的文本对图像扩散模型。我们详尽的评估指标表明，掷骰子有效地解决了多个感知任务，从而在最新模型上达到了绩效。我们仅使用其数据的0.06％的数据（例如600K与1B像素级注释的图像）以SAM-VIT-H的标准率达到结果。受Wang等人的启发，使用颜色编码来制定各种感知任务的输出。我们表明，将随机颜色分配给不同实例的策略在实体细分和语义分割中都非常有效。将各种感知任务统一为有条件的图像生成，使我们能够完全利用预先训练的文本对图像模型。因此，与从头开始训练的传统模型相比，可以以低数量级的成本进行固定训练。将我们的模型调整到其他任务时，仅需要对50张图像和1％的参数进行微调。骰子吸收为视觉通用模型提供了宝贵的见解和更有希望的解决方案。

定理解释代理：面向大语言模型定理理解的多模态解释

标题: Theorem Explain Agent: Towards Multimodal Explanations for LLM Theorem Understanding
作者: Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.19400
论文链接: https://arxiv.org/pdf/2502.19400
项目链接: https://tiger-ai-lab.github.io/TheoremExplainAgent/
gitHub仓库: https://github.com/TIGER-AI-Lab/TheoremExplainAgent

英文摘要

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

中文摘要

理解特定领域的定理通常不仅需要基于文本的推理；通过结构化视觉解释进行有效沟通对于深入理解至关重要。虽然大型语言模型（LLMs）在基于文本的定理推理中表现出色，但其生成连贯且具有教学意义的视觉解释的能力仍是一个待解决的挑战。在这项工作中，我们提出了定理解释代理（TheoremExplainAgent），一种基于代理的方法，用于生成包含曼尼姆动画的长篇幅定理解释视频（时长超过5分钟）。为系统评估多模态定理解释，我们提出了定理解释基准（TheoremExplainBench），该基准涵盖多个STEM学科的240个定理，并设计了5项自动化评估指标。实验结果表明，代理式规划对生成详细的长篇幅视频至关重要，其中o3-mini代理的成功率为93.8%，总体得分为0.77。然而，定量与定性分析表明，大多数生成的视频存在视觉元素布局的小问题。此外，多模态解释暴露了文本解释未能揭示的深层推理缺陷，这进一步凸显了多模态解释的重要性。

迈向AI联合科学家

标题: Towards an AI co-scientist
作者: Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.18864
论文链接: https://arxiv.org/pdf/2502.18864

英文摘要

Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system’s design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

中文摘要

科学发现依靠科学家产生了经过严格实验验证的新假设。为了增加此过程，我们引入了AI共同科学家，这是一种基于Gemini 2.0的多代理系统。AI共同科学家旨在帮助揭示新的原始知识，并在先前的证据基础上提出新颖的研究假设和建议，并与科学家提供的研究目标和指导保持一致。该系统的设计结合了受科学方法的启发，并通过缩放测试时间计算加速了假设产生的生成，辩论和进化方法。主要贡献包括：（1）具有异步任务执行框架的多代理体系结构，用于灵活的计算缩放；（2）自我提出假设产生的比赛进化过程。自动化评估显示测试时间计算的持续好处，改善了假设质量。虽然通用目的，但我们集中于三个生物医学领域的开发和验证：药物重新利用，新的靶标发现以及解释细菌进化和抗微生物抗性的机制。为了重新利用药物，该系统提出了具有有前途的验证结果的候选者，包括急性髓样白血病的候选者，这些候选在临床适用浓度下在体外显示肿瘤抑制。对于新的靶标发现，AI共同科学家提出了用于肝纤维化的新表观遗传靶标，并通过人肝癌的抗纤维化活性和肝细胞再生进行了验证。最后，AI共同科学家通过在细菌进化中发现了新型基因转移机制的硅酸盐发现，从而概括了未发表的实验结果。这些结果在单独的，合时的报告中详细介绍了增强生物医学和科学发现的潜力，并引入了AI授权科学家的时代。

R2-T2：多模态专家混合模型的测试时动态路由

标题: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
作者: Zhongyang Li, Ziyue Li, Tianyi Zhou
日期: 2025-02-27
ArXiv主页: https://arxiv.org/abs/2502.20395
论文链接: https://arxiv.org/pdf/2502.20395
gitHub仓库: https://github.com/tianyi-lab/R2-T2

英文摘要

In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)’ powerful reasoning capabilities, deterring LMMs’ performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs’ performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

中文摘要

在大型多模式模型（LMM）中，对非语言模式（例如，视觉表示）的感知通常与大语言模型（LLMS）的“强大推理能力相提并论），这阻止了LMMS在具有挑战性的下游任务上的表现。最近，通过用专家的混合物（MOE）代替视觉编码器，从而提供了这种弱点，该混合物提供了丰富的，多跨性的性能和多样化的下游任务所需的多种表示。多模式MOE的性能在很大程度上取决于其路由器，该路由器重量并混合了每个输入的不同专家的表示。但是，我们发现，端到端训练的路由器并不总是为每个测试样品产生最佳路由权重。为了弥合差距，我们提出了一种新颖而有效的方法“在测试时间（R2-T2）进行重新布置（R2-T2），该方法通过将其移动到测试样品中的那些正确预测样本的向量来优化测试时间中的路由权重的向量。我们提出了三个R2-T2的策略，以不同的策略进行了不同的策略，并以不同的方式进行了挑战。不同任务的基准，没有培训任何基本模型参数。

Mol-Lalama：基于大模型的分子通用理解框架

标题: Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model
作者: Dongki Kim, Wonbin Lee, Sung Ju Hwang
日期: 2025-02-19
ArXiv主页: https://arxiv.org/abs/2502.13449
论文链接: https://arxiv.org/pdf/2502.13449
项目链接: https://mol-llama.github.io/

英文摘要

Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in interpreting molecular structures, their instruction datasets are limited to the specific knowledge from task-oriented datasets and do not fully cover the fundamental characteristics of molecules, hindering their abilities as general-purpose molecular assistants. To address this issue, we propose Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules via multi-modal instruction tuning. To this end, we design key data types that encompass the fundamental features of molecules, incorporating essential knowledge from molecular structures. In addition, to improve understanding of molecular features, we introduce a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of different molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and generating relevant responses to users’ queries with detailed explanations, implying its potential as a general-purpose assistant for molecular analysis.

中文摘要

理解分子是理解生物体并推动药物发现的进步，需要跨化学和生物学的跨学科知识的关键。尽管大型分子语言模型在解释分子结构方面取得了显着的成功，但它们的指导数据集仅限于以任务为导向的数据集中的特定知识，并且不能完全涵盖分子的基本特征，从而阻碍了它们作为通用分子助手的能力。为了解决这个问题，我们提出了Mol-llama，这是一种大型分子语言模型，该模型通过多模式教学调整掌握了以分子为中心的通用知识。为此，我们设计了涵盖分子基本特征的关键数据类型，并结合了分子结构的基本知识。此外，为了提高对分子特征的理解，我们引入了一个模块，该模块整合了来自不同分子编码器的互补信息，从而利用了不同分子表示的明显优势。我们的实验结果表明，Mol-llama能够理解分子的一般特征，并通过详细的解释对用户的查询产生相关响应，这意味着其作为分子分析的通用助手的潜力。

PhotoDoodle：从少数成对数据中学习艺术图像编辑

标题: PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
作者: Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14397
论文链接: https://arxiv.org/pdf/2502.14397
gitHub仓库: https://github.com/showlab/PhotoDoodle

英文摘要

We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist’s unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.

中文摘要

我们介绍了PhotoDoodle，这是一个新颖的图像编辑框架，旨在通过使艺术家能够将装饰元素叠加到照片上来促进照片涂鸦。照片doodling具有挑战性，因为插入的元素必须与背景无缝集成，需要逼真的融合，透视图和上下文的连贯性。此外，必须在没有失真的情况下保存背景，并且必须从有限的培训数据中有效地捕获艺术家的独特风格。这些要求并未通过以前主要关注全球样式转移或区域介绍的方法来解决。拟议的方法是光电料，采用了两阶段的训练策略。最初，我们使用大规模数据训练通用图像编辑模型Omnieditor。随后，我们使用Editlora使用前后图像对的小型，艺术家策划的数据集微调了该模型，以捕获不同的编辑样式和技术。为了提高生成结果的一致性，我们引入了一个位置编码的重用机制。此外，我们发布了具有六种高质量样式的光电料数据集。广泛的实验证明了我们在定制图像编辑中的方法的高级性能和鲁棒性，为艺术创作开辟了新的可能性。

MaskGWM：基于视频掩码重建的通用驾驶世界模型

标题: MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
作者: Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
日期: 2025-02-17
ArXiv主页: https://arxiv.org/abs/2502.11663
论文链接: https://arxiv.org/pdf/2502.11663
项目链接: https://sensetime-fvg.github.io/MaskGWM
gitHub仓库: https://github.com/SenseTime-FVG/OpenDWM

英文摘要

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

中文摘要

预测行动环境变化的世界模型对于具有强烈概括的自主驾驶模型至关重要。盛行的驱动世界模型主要建立在视频预测模型上。尽管这些模型可以用基于高级扩散的发电机产生高保真视频序列，但它们受其预测持续时间和整体泛化功能的限制。在本文中，我们通过将发电损失与MAE风格的功能级上下文学习结合在一起来探讨解决这个问题。特别是，我们通过三个关键设计实例化了该目标：（1）更可扩展的扩散Transformer （DIT）结构，该结构训练有额外的蒙版构造任务。（2）我们设计了与扩散相关的面具令牌来处理掩模重建和生成扩散过程之间的模糊关系。（3）我们通过利用行式面具进行自我注意力而不是MAE中的掩盖自我注意来扩展到时空领域。然后，我们采用划分的跨视图模块与此面具设计保持一致。基于上述改进，我们提出了MaskGWM：一种可概括的驾驶世界模型，该模型具有视频面具重建。我们的模型包含两个变体：maskGWM-long，重点关注长马预测，而maskgwm-mview则专用于多视图生成。对标准基准的综合实验验证了所提出方法的有效性，该方法包含Nuscene数据集的正常验证，OPENDV-2K数据集的长途推出以及Waymo数据集的零拍验证。这些数据集上的定量指标表明我们的方法特别改善了最新的驾驶世界模型。

NeoBERT：下一代 BERT

标题: NeoBERT: A Next-Generation BERT
作者: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.19587
论文链接: https://arxiv.org/pdf/2502.19587

英文摘要

Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

中文摘要

建筑，预培训和微调的最新创新导致了大型自动退缩语言模型（例如Llama and Deepseek）的非凡学习和推理能力。相比之下，尽管许多下游NLP应用是基础，但像Bert和Roberta这样的编码者并未看到相同的进度。为了弥合这一差距，我们介绍了Neobert，Neobert是下一代编码器，通过整合建筑，现代数据和优化的预训练方法中的最新进步，重新定义了双向模型的功能。Neobert专为无缝采用而设计：它是现有基本型号的插件替代品，依赖于最佳的深度宽度比，并利用了4,096个令牌的扩展上下文长度。尽管具有2500亿个参数足迹，但它在巨大的MTEB基准测试中取得了最先进的结果，在相同的微调条件下，伯特（Bert）大，罗伯塔（Roberta），大，名字师和现代伯特（Modernbert）的表现优于大，罗伯塔（Roberta）。此外，我们严格评估每种修饰对胶水的影响，并为MTEB设计一个均匀的微调和评估框架。我们发布所有代码，数据，检查点和培训脚本，以加速研究和现实世界中的采用。

LongRoPE2：近距离LLM上下文窗口缩放

标题: LongRoPE2: Near-Lossless LLM Context Window Scaling
作者: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
日期: 2025-02-27
ArXiv主页: https://arxiv.org/abs/2502.20082
论文链接: https://arxiv.org/pdf/2502.20082

英文摘要

LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by “needle-driven” perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens – 80x fewer than Meta’s approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

中文摘要

Longrope2是一种新颖的方法，它将预训练的大语言模型（LLM）的有效上下文窗口扩展到目标长度，同时保留在原始较短上下文窗口上的性能。这是通过三个贡献来实现的：（1）一个假设，即较高绳索维度的训练不足有助于在现有方法中观察到的持续分布（OOD）问题；（2）一种有效的绳索恢复算法，该算法采用以“针驱动”的困惑为指导的进化搜索来解决训练问题不足；（3）一种混合的上下文窗口训练方法，该方法微调型号的权重以采用重新续线序列的重新绳索，同时用原始绳索保留短上下文性能。在各种基准的LLAMA3-8B和PHI3-MINI-3.8B上进行了广泛的实验验证了假设并证明了longrope2的有效性。值得注意的是，Longrope2扩展了Llama3-8B以实现128K有效上下文长度，同时仅使用10B代币保留了98.5％的短上下文性能 - 比Meta的方法少了80倍，而Meta的方法却少了80倍，而Meta未能达到目标有效上下文长度。代码将在https://github.com/microsoft/longrope上找到。

Audio-FLAN：初步版本

标题: Audio-FLAN: A Preliminary Release
作者: Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
日期: 2025-02-23
ArXiv主页: https://arxiv.org/abs/2502.16584
论文链接: https://arxiv.org/pdf/2502.16584
gitHub仓库: https://github.com/lmxue/Audio-FLAN%7D%7D

英文摘要

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

中文摘要

音频令牌化的最新进展显着增强了音频功能与大语言模型（LLMS）的集成。但是，音频理解和产生通常被视为不同的任务，从而阻碍了真正统一的音频模型的发展。尽管教学调整在改善文本和愿景的零击学习方面取得了巨大的成功，但其在音频上的应用仍然很大程度上没有探索。一个主要的障碍是缺乏统一音频理解和产生的全面数据集。为了解决这个问题，我们介绍了Audio-Flan，这是一个大规模的指令数据集，涵盖了超过1亿个实例，涵盖了跨语音，音乐和声音域的80个不同任务。Audio-Flan为统一的音频语言模型奠定了基础，这些模型可以无缝处理以零拍的方式跨各种音频域的理解（例如，转录，理解）和发电（例如，语音，音乐，音乐，声音）任务。Audio-Flan数据集可在HuggingFace和GitHub上使用，并且将不断更新。

ART：可变多层透明图像生成的匿名区域Transformer

标题: ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
作者: Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, Baining Guo
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.18364
论文链接: https://arxiv.org/pdf/2502.18364

英文摘要

Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.

中文摘要

多层图像生成是一项基本任务，使用户能够隔离，选择和编辑特定的图像层，从而彻底改变了与生成模型的相互作用。在本文中，我们介绍了匿名区域Transformer （ART），该Transformer （ART）促进了基于全局文本提示和匿名区域布局的可变多层透明图像的直接生成。受模式理论的启发表明，知识是在框架（架构）中组织的，使人们能够通过将其链接到先验知识来解释和学习。}，这种匿名区域布局允许生成模型可以自主地自动确定哪种文本应与以前占主导地位的Semantic Semantic Leaut for Image Generation Image Generation Image Generation of Image Generation Image Generation sagents of哪种文本相结合。此外，仅选择属于每个匿名区域的视觉令牌的层裁剪机制可显着降低注意力计算成本，并可以有效地产生具有许多不同层的图像（例如50+）。与完全关注的方法相比，我们的方法的速度超过12倍，并且层冲突较少。此外，我们提出了一个高质量的多层透明图像自动编码器，该图像支持可变多层图像的透明度的直接编码和解码。通过启用精确的控制和可扩展的层产生，ART为交互式内容创建建立了一个新的范式。

KV-Edit：无训练的图像编辑，用于精确背景保护

标题: KV-Edit: Training-Free Image Editing for Precise Background Preservation
作者: Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong Tang
日期: 2025-02-24
ArXiv主页: https://arxiv.org/abs/2502.17363
论文链接: https://arxiv.org/pdf/2502.17363
项目链接: https://xilluill.github.io/projectpages/KV-Edit/
gitHub仓库: https://github.com/Xilluill

英文摘要

Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to O(1) using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods. Project webpage is available at https://xilluill.github.io/projectpages/KV-Edit

中文摘要

背景一致性仍然是图像编辑任务的重大挑战。尽管有广泛的发展，但现有作品仍然在保持与原始图像相似的相似性和生成与目标保持一致的内容之间面临权衡。在这里，我们提出了KV-EDIT，这是一种无训练的方法，它使用DIT中的KV缓存来维持背景一致性，在此中，保留背景令牌而不是再生，从而消除了对复杂机制或昂贵培训的需求，最终生成了新内容，最终会在用户培养的区域内与背景无缝集成。我们进一步探讨了编辑过程中KV缓存的内存消耗，并使用无反转方法优化了O（1）的空间复杂性。我们的方法与任何基于DIT的生成模型都兼容，没有其他培训。实验表明，KV-EDIT在背景和图像质量方面显着优于现有方法，甚至超过了基于培训的方法。项目网页可从https://xilluill.github.io/projectpages/kv-edit获得

Plutus：低资源希腊金融中的大型语言模型的基准测试

标题: Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
作者: Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou
日期: 2025-02-26
ArXiv主页: https://arxiv.org/abs/2502.18772
论文链接: https://arxiv.org/pdf/2502.18772

英文摘要

Despite Greece’s pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.

中文摘要

尽管希腊在全球经济中起着关键作用，但由于希腊语的语言复杂性和特定领域的数据集的稀缺性，大型语言模型（LLMS）仍未在希腊财务背景下遭受重视。以前的多语言金融自然语言处理（NLP）已经揭示了相当大的绩效差异，但是到目前为止，还没有开发出专用的希腊金融基准或希腊特定的金融LLM。为了弥合这一差距，我们介绍了希腊第一个金融评估基准Plutus-ben和Plutus-8B，Pioneering Greek Financial LLM，以希腊特定的域名数据进行了微调。Plutus-ben在希腊语中介绍了五个核心财务NLP任务：数字和文本命名实体识别，问答，抽象性摘要和主题分类，从而促进了系统的和可重复的LLM评估。为了支撑这些任务，我们介绍了三个小说，高质量的希腊财务数据集，并由专家本地希腊人的专家注释，并由两个现有资源增强。我们对22个LLM在冥王星ben上的全面评估表明，由于语言复杂性，特定于领域的术语和财务推理差距，希腊财务NLP仍然具有挑战性。这些发现强调了跨语言转移的局限性，在希腊培训模型中的财务专业知识的必要性以及将财务LLMS适应希腊文本的挑战。我们公开发布Plutus-Ben，Plutus-8B和所有相关数据集，以促进可重复的研究并促进希腊财务NLP，从而促进了财务的更广泛的多语言包容性。

语言模型的事实取决于查询的语言

标题: Language Models’ Factuality Depends on the Language of Inquiry
作者: Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang
日期: 2025-02-25
ArXiv主页: https://arxiv.org/abs/2502.17955
论文链接: https://arxiv.org/pdf/2502.17955
gitHub仓库: https://github.com/kmrtanmay/X_FaKT

英文摘要

Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today’s state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.

中文摘要

预计多语言语言模型（LMS）将跨语言始终如一地回顾事实知识，但是即使在其中一种语言中拥有正确的信息，它们也经常在语言之间转移知识。例如，我们发现，当用阿拉伯语询问LM时，LM可以正确地识别出来自沙特阿拉伯的Rash Al Shashai，但在用英语或斯瓦希里语询问时，始终未能这样做。为了系统地研究这一限制，我们介绍了13种语言的10,000个与国家相关的事实的基准，并提出了三种新颖的指标：事实召回得分，知识转移性得分和跨语性的事实知识转移能力，以量化不同语言的LMS的事实转移和知识转移性。我们的结果揭示了当今最先进的LMS中的根本弱点，尤其是在跨语言概括中，模型无法有效地跨不同语言转移知识，从而导致对所使用语言敏感的性能不一致。我们的发现强调了LMS必须识别特定语言的事实可靠性并利用跨语言最值得信赖的信息。我们发布基准和评估框架，以推动多语言知识转移的未来研究。

SIFT：通过情境贴纸夯实大语言模型的推理基础

标题: SIFT: Grounding LLM Reasoning in Contexts via Stickers
作者: Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
日期: 2025-02-19
ArXiv主页: https://arxiv.org/abs/2502.14922
论文链接: https://arxiv.org/pdf/2502.14922
gitHub仓库: https://github.com/zhijie-group/SIFT

英文摘要

This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase “10 dollars per kilo,” LLMs might not recognize that “per” means “for each,” leading to calculation errors. We introduce a novel, post-training approach called Stick to the Facts (SIFT) to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the Sticker, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions – one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via forward optimization (to better align the extracted facts with the query) and inverse generation (to conform with the model’s inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to 85.67%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.

中文摘要

本文发现大语言模型（涵盖从Llama3.2-3B-Instruct到DeepSeek-R1等不同规模模型）在推理过程中普遍存在上下文误读问题。以"10美元/公斤"为例，模型可能无法正确解析"per"表征的"每单位"计算关系，从而导致运算错误。为解决该问题，我们提出一种名为"事实锚定法"(SIFT, Stick to the Facts)的后训练优化框架。该方法通过动态分配推理计算资源，将大模型的推理过程锚定于特定上下文情境。其核心创新在于由模型自主生成的"情境贴纸"(Sticker)——该标记通过显式强调上下文关键信息实现语义聚焦。具体而言，SIFT会并行生成两种预测结果：基于原始查询的常规预测和基于情境贴纸增强查询的优化预测。当两者出现分歧时，系统将启动迭代优化流程：首先通过前向优化调整贴纸使其精确匹配查询需求，继而通过逆向生成约束优化方向以符合模型固有认知模式，最终实现更可靠的推理结果。我们在3B到100B+不同量级模型及GSM8K、MATH-500等基准测试中验证了方法的有效性，结果显示平均准确率提升5.3-7.8个百分点。特别地，DeepSeek-R1在AIME2024测试中pass@1准确率从78.33%提升至85.67%，创下开源社区新纪录。本研究突破传统参数微调范式，通过动态标记增强实现了推理过程可解释性提升与知识纠偏成本降低80%的双重优势，相关代码及预训练参数已开源：https://github.com/zhijie-group/SIFT。

查看全文

http://www.xdnf.cn/news/254395.html