当前位置：首页 > java >正文

【论文速递】2025年08周（Robotics/Embodied AI/LLM）

java 2025/7/3 7:36:34

MLGYM：推进AI研究代理的新框架和基准
- 英文摘要
- 中文摘要
QWEN2.5-VL技术报告
- 英文摘要
- 中文摘要
原生稀疏注意力：硬件对齐且原生可训练的稀疏注意力机制
- 英文摘要
- 中文摘要
Siglip 2：具有改进的语义理解，本地化和密集特征的多语言视觉语言编码器
- 英文摘要
- 中文摘要
大语言扩散模型
- 英文摘要
- 中文摘要
SuperGPQA：跨285个研究生学科的LLM评估
- 英文摘要
- 中文摘要
在不影响大语言模型性能的前提下，你能向LoRA适配器中注入多少知识？
- 英文摘要
- 中文摘要
Soundwave：在大语言模型中更少即是更好——语音与文本对齐的新方法
- 英文摘要
- 中文摘要
将1568个Tokens 压缩到一个向量中，再还原回来：探索嵌入空间容量的极限
- 英文摘要
- 中文摘要
S \*：代码生成的测试时间缩放
- 英文摘要
- 中文摘要
Phantom：跨模式对齐的主题一致性视频生成
- 英文摘要
- 中文摘要
Magma：多模态AI代理的基础模型
- 英文摘要
- 中文摘要
过度思考的危险：检查代理任务中的推理行动困境
- 英文摘要
- 中文摘要
Step-Video-T2V技术报告：视频基础模型的实践，挑战和未来
- 英文摘要
- 中文摘要
语言建模的连续扩散模型
- 英文摘要
- 中文摘要
扩散Transformers的区域自适应采样
- 英文摘要
- 中文摘要
Logic-RL：通过基于规则的强化学习释放LLM推理
- 英文摘要
- 中文摘要
关于生成基础模型的可信赖性：指南，评估和观点
- 英文摘要
- 中文摘要
SWE-Lancer：Frontier LLM可以从现实世界中的自由软件工程中赚取100万美元吗？
- 英文摘要
- 中文摘要
Zerobench：当代大型多模态模型的一个不可能的视觉基准测试
- 英文摘要
- 中文摘要
Songgen：文本到歌曲一代的单阶段自动回归Transformers
- 英文摘要
- 中文摘要
学习现实世界人形机器人的站立恢复策略
- 英文摘要
- 中文摘要
RAD：通过大规模3DGS的强化学习培训端到端驾驶策略
- 英文摘要
- 中文摘要
多模式MAMBA：通过二次到线性蒸馏实现的仅解码器多模态状态空间模型
- 英文摘要
- 中文摘要
通过主成分分析重新思考多样化的人类偏好学习
- 英文摘要
- 中文摘要
你并未充分利用Transformer的表示能力
- 英文摘要
- 中文摘要

MLGYM：推进AI研究代理的新框架和基准

标题: MLGym: A New Framework and Benchmark for Advancing AI Research Agents
作者: Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14499
论文链接: https://arxiv.org/pdf/2502.14499
gitHub仓库: https://github.com/facebookresearch/MLGym

英文摘要

We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

中文摘要

我们介绍了Meta Mlgym和Mlgym-Bench，这是一个新的框架和基准，用于评估和开发AI研究任务的LLM代理。这是第一个用于机器学习（ML）任务的健身房环境，为培训此类代理的增强学习（RL）算法提供了研究。MLGYM基础由来自计算机视觉，自然语言处理，强化学习和游戏理论等不同领域的13种不同和开放式的AI研究任务组成。解决这些任务需要现实世界中的AI研究技能，例如生成新的想法和假设，创建和处理数据，实施ML方法，培训模型，运行实验，分析结果并在此过程中进行迭代以改进给定的任务。我们在基准上评估了许多边界大型语言模型（LLM），例如Claude-3.5-Sonnet，Llama-3.1 405B，GPT-4O，O1-Preview和Gemini-1.5 Pro。我们的MLGYM框架使添加新任务，集成和评估模型或代理，按大规模生成综合数据，并为培训AI研究任务培训代理开发新的学习算法变得容易。我们发现，当前的边界模型通常可以通过找到更好的超参数来改善给定的基线，但不会产生新颖的假设，算法，体系结构或实质性改进。我们开源的框架和基准，以促进未来的研究，以促进LLM代理的AI研究能力。

QWEN2.5-VL技术报告

标题: Qwen2.5-VL Technical Report
作者: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin
日期: 2025-02-19
ArXiv主页: https://arxiv.org/abs/2502.13923
论文链接: https://arxiv.org/pdf/2502.13923
gitHub仓库: https://github.com/QwenLM/Qwen2.5-VL

英文摘要

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

中文摘要

我们介绍了QWEN2.5-VL，这是QWEN Vision-Language系列的最新旗舰模型，该模型在基础能力和创新功能方面都取得了重大进步。QWEN2.5-VL通过增强的视觉识别，精确的对象本地化，可靠的文档解析和长期Video的理解，在理解和与世界互动方面取得了重大飞跃。QWEN2.5-VL的出色功能是它可以准确地使用边界框或点定位对象的能力。它提供了从发票，表单和表格中提取可靠的结构化数据，以及图表，图表和布局的详细分析。为了处理复杂的输入，QWEN2.5-VL引入了动态分辨率处理和绝对时间编码，从而使其能够通过第二级事件本地化处理不同尺寸的图像和延长持续时间（最多小时）的视频。这使模型可以在不依赖传统标准化技术的情况下本地感知的空间尺度和时间动力学。通过从划痕中训练本地动态分辨率Transformers（VIT）并纳入窗户注意力，我们可以在维持天然分辨率的同时减少计算开销。结果，QWEN2.5-VL不仅在静态图像和文档理解中脱颖而出，而且在实际情况下（例如操作计算机和移动设备）中的推理，工具使用和任务执行的交互式视觉代理。QWEN2.5-VL可提供三种尺寸，可解决从边缘AI到高性能计算的不同用例。旗舰QWEN2.5-VL-72B型号与GPT-4O和Claude 3.5十四行诗等最新模型相匹配，在文档和图表的理解方面尤其出色。此外，QWEN2.5-VL保持了强大的语言表现，并保留了QWEN2.5 LLM的核心语言能力。

原生稀疏注意力：硬件对齐且原生可训练的稀疏注意力机制

标题: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
作者: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
日期: 2025-02-16
ArXiv主页: https://arxiv.org/abs/2502.11089
论文链接: https://arxiv.org/pdf/2502.11089

英文摘要

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.

中文摘要

长文本建模对于下一代语言模型至关重要，但是标准注意机制的高计算成本却带来了重大的计算挑战。稀疏的注意力为提高效率的方向提供了有希望的方向，同时保持模型功能。我们提出了NSA，这是一种本地可训练的稀疏注意机制，将算法创新与硬件一致的优化相结合，以实现有效的长篇文化建模。NSA采用了动态的分层稀疏策略，将粗粒的令牌压缩与精细的令牌选择相结合，以保持全球环境意识和局部精度。我们的方法通过两个关键创新进行了稀疏注意设计：（1）我们通过算术强度平衡算法设计实现了实质性的加速，并对现代硬件进行了优化。（2）我们启用端到端培训，在不牺牲模型性能的情况下减少预处理的计算。如图1所示，实验表明，使用NSA预测的模型维持或超过了一般基准，长篇下说任务和基于指导的推理的全部注意力模型。同时，NSA在对解码，正向传播和向后传播的64k长度序列上的全面关注中实现了实质性加速，从而验证了整个模型生命周期的效率。

Siglip 2：具有改进的语义理解，本地化和密集特征的多语言视觉语言编码器

标题: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
作者: Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14786
论文链接: https://arxiv.org/pdf/2502.14786
项目链接: https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md
gitHub仓库: https://github.com/google-research/big_vision

英文摘要

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe – this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

中文摘要

我们介绍了Siglip 2，这是一个基于原始Siglip成功的新型多语言视觉语言编码器。在第二次迭代中，我们将原始的图像文本训练目标扩展到了几个先前的独立开发技术中，以统一的配方 - 包括基于字幕的基于字幕的预处理，自我避免的损失（自我介绍，掩盖的预测）和在线数据策展。通过这些更改，Siglip 2模型在核心功能的所有模型尺度上都优于其siglip对应物，包括零摄像分类，图像文本检索和转移性能在提取视觉模型（VLMS）的视觉表示时。此外，新的培训配方可大大改善本地化和密集的预测任务。我们还训练支持多个分辨率并保留输入的本地纵横比的变体。最后，我们培训了包括偏见技术在内的更多样化的数据混合物，从而导致了更好的多语言理解和改善的公平性。为了允许用户与性能进行交易推理成本，我们以四个尺寸发布模型检查站：VIT-B（86m），L（303m），SO400M（400m）和G（1B）。

大语言扩散模型

标题: Large Language Diffusion Models
作者: Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
日期: 2025-02-14
ArXiv主页: https://arxiv.org/abs/2502.09992
论文链接: https://arxiv.org/pdf/2502.09992
项目链接: https://ml-gsai.github.io/LLaDA-demo/

英文摘要

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

中文摘要

自回旋模型（ARM）被广泛认为是大语言模型（LLMS）的基石。我们通过引入LLADA来挑战这一概念，这是一种在预训练和监督的微调（SFT）范式下从头开始训练的扩散模型。LLADA通过向前数据掩蔽过程和反向过程进行建模，该过程由Vanilla Transformer进行参数以预测掩盖的令牌。通过优化可能性约束，它为概率推断提供了一种原则上的生成方法。在广泛的基准测试中，LLADA表现出强大的可扩展性，优于我们自我建造的手臂基线。值得注意的是，LLADA 8B具有强大的LLM竞争性LLM，例如Llama3 8b在封闭式学习中，在SFT之后，在诸如多转向对话之类的案例研究中表现出令人印象深刻的跟随能力。此外，LLADA解决了逆转诅咒，超过了逆转诗完成任务中的GPT-4O。我们的发现将扩散模型建立为武器的可行且有前途的替代方案，挑战了以下假设：上面讨论的关键LLM能力与武器固有地息息相关。

SuperGPQA：跨285个研究生学科的LLM评估

标题: SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
作者: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14739
论文链接: https://arxiv.org/pdf/2502.14739
项目链接: https://supergpqa.github.io/

英文摘要

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

中文摘要

大型语言模型（LLM）表现出在数学，物理学和计算机科学等主流学科的熟练程度上非常熟练。但是，人类知识涵盖了200多个专业学科，远远超过了现有基准的范围。LLM在许多专业领域中的功能在光线行业，农业和面向服务的学科中的评估不足。为了解决这一差距，我们提出了SuperGPQA，这是一个全面的基准，可评估285个学科的研究生级知识和推理能力。我们的基准测试采用了一种新型的人体协作过滤机制，通过基于LLM的响应和专家反馈来消除琐碎或模棱两可的问题。我们的实验结果揭示了在各种知识领域跨不同知识领域的当前最先进LLM的性能的明显空间（例如，以推理为中心的模型DeepSeek-R1在SuperGPQA上达到了61.82％的最高准确性），突出了当前模型能力和人工通用智能之间的相当大差距。此外，我们从管理大规模注释过程的管理中展示了全面的见解，涉及80多个专家注释和一个交互式的人类协作系统，为未来的可比范围研究计划提供了宝贵的方法学指南。

在不影响大语言模型性能的前提下，你能向LoRA适配器中注入多少知识？

标题: How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
作者: Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14502
论文链接: https://arxiv.org/pdf/2502.14502
gitHub仓库: https://github.com/AIRI-Institute/knowledge-packing

英文摘要

The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model’s parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model’s performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.

中文摘要

大语言模型（LLM）在许多任务上的性能受到了在预训练期间学到的知识并存储在模型参数中的知识的限制。低级适应性（LORA）是一种流行而有效的训练技术，用于更新LLM的特定于域的适应性。在这项研究中，我们研究了如何使用LORA将新事实纳入LLM，而不会损害先前学习的知识。我们使用洛拉（Lora）和不同数量的新知识微调了Llama-3.1-8B教学。我们的实验表明，当训练数据包含已知事实和新事实的混合物时，获得最佳结果。但是，这种方法仍然有害，因为该模型在这种微调后的外部提问基准测试中的性能下降。当培训数据偏向某些实体时，该模型倾向于回归几乎代表过度的答案。此外，我们发现该模型变得更加自信，并且拒绝在很少的情况下提供答案。这些发现突出了基于LORA的LLM更新的潜在陷阱，并强调了培训数据组成和调整参数的重要性，以平衡新知识集成和一般模型功能。

Soundwave：在大语言模型中更少即是更好——语音与文本对齐的新方法

标题: Soundwave: Less is More for Speech-Text Alignment in LLMs
作者: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.12900
论文链接: https://arxiv.org/pdf/2502.12900
gitHub仓库: https://github.com/FreedomIntelligence/Soundwave

英文摘要

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

中文摘要

现有的端到端语音大语模型（LLM）通常依赖大规模注释的数据进行培训，而尚未深入讨论数据效率的培训。我们专注于语音和文本之间的两个基本问题：表示空间差距和序列长度不一致。我们提出了Soundwave，它利用有效的培训策略和新颖的体系结构来解决这些问题。结果表明，Soundwave在语音翻译和空中台式语音任务中的高级Qwen2-Audio的表现仅使用五十个培训数据。进一步的分析表明，Soundwave在对话中仍然保留其智能。该项目可从https://github.com/freedomintelligence/soundwave获得。

将1568个Tokens 压缩到一个向量中，再还原回来：探索嵌入空间容量的极限

标题: Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
作者: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13063
论文链接: https://arxiv.org/pdf/2502.13063

英文摘要

A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

中文摘要

最近的一系列作品涉及将令牌序列压缩到较短的实价矢量序列中的问题，该序列被用作输入，而不是令牌嵌入或键值缓存。这些方法可以减少现有语言模型中的计算量。尽管依靠功能强大的模型作为编码器，但最大可实现的无损压缩比通常不高于x10。这一事实非常吸引人，因为从理论上讲，即使对于16位精度和适度的向量大小，大型实价矢量的最大信息能力也远远超出了呈现的速率。在这项工作中，我们通过使用样本优化过程代替编码器来探索压缩的限制。我们表明，具有高达X1500的压缩比的向量突出了现有解决方案和实际实现的解决方案之间的两个数量级差距。此外，我们从经验上表明，压缩极限不是由输入的长度确定的，而是由要降低的不确定性量（即，在此序列上的跨凝性损失而没有任何条件。所获得的限制突出了输入嵌入的理论能力与其实际利用之间的巨大差距，这表明在模型设计中优化了很大的空间。

S *：代码生成的测试时间缩放

标题: S*: Test Time Scaling for Code Generation
作者: Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14382
论文链接: https://arxiv.org/pdf/2502.14382
项目链接: https://novasky-ai.github.io/posts/S*/
gitHub仓库: https://github.com/NovaSky-AI/SkyThought

英文摘要

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.

中文摘要

LLMS的测试时间计算的增加显示了跨域的希望，但尽管广泛研究了数学研究，但在代码生成中仍未充满信心。在本文中，我们提出了S *，这是第一个混合测试时间缩放框架，可大大提高生成代码的覆盖范围和选择精度。S *通过顺序缩放扩展了现有的并行缩放范式以推动性能边界。它进一步利用了一种新型的选择机制，该机制可自适应地生成分别的成对比较的输入，并结合执行的信息以鲁棒性识别正确的解决方案。我们在12个大型语言模型和大型推理模型上进行评估，并显示：（1）S *始终提高模型家族和大小的性能，从而使3B模型表现优于GPT-4O-Mini；（2）S *使非争议模型能够超过推理模型-LiveCodeBench上的S *均优于3.7％的GPT-4O-MINI；（3）S *进一步提高了最先进的推理模型-DeepSeek-R1-Distill-Qwen-32b，带有S *的livecodebench上的85.7％，接近O1（高）为88.5％。代码将根据https://github.com/novasky-ai/skythought提供。

Phantom：跨模式对齐的主题一致性视频生成

标题: Phantom: Subject-consistent video generation via cross-modal alignment
作者: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
日期: 2025-02-16
ArXiv主页: https://arxiv.org/abs/2502.11079
论文链接: https://arxiv.org/pdf/2502.11079
项目链接: https://phantom-video.github.io/Phantom/
gitHub仓库: https://github.com/Phantom-video/Phantom

英文摘要

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

中文摘要

视频发电的基础模型的持续开发正在发展为各种应用程序，主题一致的视频生成仍处于探索阶段。我们将其称为主题到视频，该主题从参考图像中提取主题元素，并通过文本说明生成主题一致的视频。我们认为，主题到视频的本质在于平衡文本和图像的双模式提示，从而深入并同时使文本和视觉内容对齐。为此，我们提出了Phantom，这是单个和多主题参考的统一视频生成框架。在现有的文本到视频和图像到视频体系结构的基础上，我们重新设计了联合文本图像注入模型，并通过文本图像 - 视频 - 视频图三重态数据将其驱动以学习跨模式对齐。特别是，我们强调了人类一代的主题一致性，涵盖了现有的ID保存视频生成，同时提供了增强的优势。项目主页在这里https://phantom-video.github.io/phantom/。

Magma：多模态AI代理的基础模型

标题: Magma: A Foundation Model for Multimodal AI Agents
作者: Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13130
论文链接: https://arxiv.org/pdf/2502.13130
项目链接: https://microsoft.github.io/Magma

英文摘要

We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.

中文摘要

我们提出了岩浆，这是一个基础模型，可在数字世界和物理世界中提供多模式AI代理任务。岩浆是视觉语言（VL）模型的显着扩展，因为它不仅保留了后者的VL理解能力（语言智能），而且还具有在视觉空间世界（空间 - 周期性的智能）中计划和行动的能力，并且从UI导航到机器人操作。To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action计划。广泛的实验表明，SOM和Tom达到了良好的协同作用，并促进了我们的岩浆模型获得时空智能的获取，这对于多种任务至关重要，如图1所示。特别是，岩浆在UI导航和机器人操纵任务上创建了新的最新结果，超过了专门针对这些任务量身定制的先前模型。在图像和视频相关的多模式任务上，岩浆还与受到大量较大数据集训练的流行大型多模型相比。我们在https://microsoft.github.io/magma上将模型和代码公开为可重复性。

过度思考的危险：检查代理任务中的推理行动困境

标题: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
作者: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez
日期: 2025-02-12
ArXiv主页: https://arxiv.org/abs/2502.08235
论文链接: https://arxiv.org/pdf/2502.08235

英文摘要

Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.

中文摘要

大型推理模型（LRMS）代表了AI解决问题的功能的突破，但是它们在交互式环境中的有效性可能受到限制。本文介绍和分析了LRMS中的过度思考。模型有利于扩展内部推理链而不是环境相互作用的现象。通过经过验证的SWE台式的软件工程任务实验，我们观察到了三种反复出现的模式：分析瘫痪，流氓动作和过早脱离接触。我们提出了一个研究这些行为的框架，该行为与人类专家评估相关，并分析4018个轨迹。我们观察到，较高的过度思考得分与性能下降相关，与非争议模型相比，推理模型表现出更强的倾向对过度思考的趋势。我们的分析表明，在代理环境中减轻过度思考的简单努力，例如选择较低的过度思考得分的解决方案，可以将模型性能提高近30％，同时将计算成本降低43％。这些结果表明，缓解过度思考具有强大的实际含义。我们建议，通过利用天然功能的功能和选择性的强化学习过度思考倾向，可以缓解。我们还开放我们的评估框架和数据集，以在https://github.com/alexcuadron/overthinking上促进该方向的研究。

Step-Video-T2V技术报告：视频基础模型的实践，挑战和未来

标题: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
作者: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
日期: 2025-02-14
ArXiv主页: https://arxiv.org/abs/2502.10248
论文链接: https://arxiv.org/pdf/2502.10248

英文摘要

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V’s performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

中文摘要

我们提出了Step-Video-T2V，这是一种具有30b参数的最先进的文本对电视预训练的模型，并且能够生成长度204帧的视频。深层压缩变量自动编码器Video-VAE是为视频生成任务而设计的，可实现16x16的空间和8倍的时间压缩比，同时保持出色的视频重建质量。用户提示使用两个双语文本编码器编码以处理英语和中文。使用流量匹配训练了具有3D全心注意的DIT，并被用来将输入噪声转化为潜在的框架。采用基于视频的DPO方法Video-DPO，用于减少工件并提高生成视频的视觉质量。我们还详细介绍了我们的培训策略，并分享关键的观察和见解。Step-Video-T2V的性能是在新型视频生成基准测试中进行评估的，即Step-Video-T2V-eval，与开源和商用发动机相比，它证明了其最先进的文本对视频质量。此外，我们讨论了当前基于扩散的模型范式的局限性，并概述了视频基础模型的未来指示。我们可以在https://github.com/stepfun-ai/step-video-t2v上提供Step-Video-T2V和Step-Video-T2V-Eval。也可以从https://yuewen.cn/videos访问在线版本。我们的目标是加速视频基础模型的创新并授权视频内容创建者。

语言建模的连续扩散模型

标题: Continuous Diffusion Model for Language Modeling
作者: Jaehyeong Jo, Sung Ju Hwang
日期: 2025-02-17
ArXiv主页: https://arxiv.org/abs/2502.11564
论文链接: https://arxiv.org/pdf/2502.11564

英文摘要

Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at https://github.com/harryjo97/RDLM{https://github.com/harryjo97/RDLM}.

中文摘要

扩散模型已成为建模离散分类数据的自回旋模型的有希望的替代方法。然而，直接在离散数据空间上工作的扩散模型并不能完全利用迭代精致的功能，因为在离散状态之间的过渡期间，信号丢失。与离散方法相比，现有的离散数据连续扩散模型的性能有限，并且它们之间的不清楚链接限制了离散数据扩散模型的开发。在这项工作中，我们为语言建模提出了一个连续的扩散模型，该模型结合了基本分配分布的几何形状。我们在统计歧管上的离散扩散和连续流之间建立了联系，并以类比为基础，为扩散过程引入了一个简单的设计，该设计概括了以前的离散扩散模型。我们进一步提出了一个基于径向对称性的无模拟训练框架，以及一种简单的技术来解决歧管的高维度。关于语言建模基准和其他模式的全面实验表明，我们的方法的表现优于现有的离散扩散模型，并处理自回归模型的性能。可在https://github.com/harryjo97/rdlm {https://github.com/harryjo97/rdlm}上获得代码。

扩散Transformers的区域自适应采样

标题: Region-Adaptive Sampling for Diffusion Transformers
作者: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang
日期: 2025-02-14
ArXiv主页: https://arxiv.org/abs/2502.10389
论文链接: https://arxiv.org/pdf/2502.10389

英文摘要

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model’s focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

中文摘要

扩散模型（DMS）已成为跨不同领域生成任务的主要选择。但是，它们对多个顺序前向的依赖会显着限制实时性能。先前的加速方法主要集中于减少采样步骤的数量或重复使用中间结果，由于卷积U-NET结构的限制，未能利用图像内空间区域的变化。通过利用在处理可变的令牌数量中的扩散Transformers（DIT）的灵活性，我们引入了RAS，RAS是一种新颖的，无训练的采样策略，该策略将基于DIT模型的焦点的图像中的区域动态分配不同的采样比。我们的主要观察结果是，在每个采样步骤中，模型都集中在语义上有意义的区域上，而这些重点领域在连续步骤之间表现出很强的连续性。利用这种见解，RAS仅更新当前焦点的区域，而其他区域则使用上一步中的噪声更新。模型的重点是根据前面步骤的输出确定的，并利用了我们观察到的时间一致性。我们在稳定扩散3和Lumina-Next-T2I上评估RAS，分别达到2.36倍和2.51倍的加速，生成质量的降解最小。此外，一项用户研究表明，在达到1.6倍加速的同时，RAS在人类评估下提供了可比的品质。我们的方法迈出了更有效的扩散Transformers迈出的重要一步，增强了其实时应用的潜力。

Logic-RL：通过基于规则的强化学习释放LLM推理

标题: Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
作者: Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14768
论文链接: https://arxiv.org/pdf/2502.14768

英文摘要

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

中文摘要

受DeepSeek-R1成功的启发，我们探讨了大型推理模型中基于规则的增强学习（RL）的潜力。为了分析推理动力学，我们将合成的逻辑难题用作训练数据，因为它们的可控复杂性和直接的答案验证。我们做出一些关键的技术贡献，从而导致有效稳定的RL培训：强调思维和答案过程的系统提示，严格的格式奖励功能，惩罚了对选择快捷方式的产出以及可实现稳定收敛的直接培训食谱。我们的7b模型开发了先进的推理技能，例如反射，验证和摘要，而逻辑语料库中没有。值得注意的是，在仅培训了5K逻辑问题之后，它证明了具有挑战性的数学基准AIME和AMC的概括能力。

关于生成基础模型的可信赖性：指南，评估和观点

标题: On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
作者: Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, Yuan Li, Han Bao, Zhaoyi Liu, Tianrui Guan, Dongping Chen, Ruoxi Chen, Kehan Guo, Andy Zou, Bryan Hooi Kuen-Yew, Caiming Xiong, Elias Stengel-Eskin, Hongyang Zhang, Hongzhi Yin, Huan Zhang, Huaxiu Yao, Jaehong Yoon, Jieyu Zhang, Kai Shu, Kaijie Zhu, Ranjay Krishna, Swabha Swayamdipta, Taiwei Shi, Weijia Shi, Xiang Li, Yiwei Li, Yuexing Hao, Yuexing Hao, Zhihao Jia, Zhize Li, Xiuying Chen, Zhengzhong Tu, Xiyang Hu, Tianyi Zhou, Jieyu Zhao, Lichao Sun, Furong Huang, Or Cohen Sasson, Prasanna Sattigeri, Anka Reuel, Max Lamparth, Yue Zhao, Nouha Dziri, Yu Su, Huan Sun, Heng Ji, Chaowei Xiao, Mohit Bansal, Nitesh V. Chawla, Jian Pei, Jianfeng Gao, Michael Backes, Philip S. Yu, Neil Zhenqiang Gong, Pin-Yu Chen, Bo Li, Xiangliang Zhang
日期: 2025-02-20
ArXiv主页: https://arxiv.org/abs/2502.14296
论文链接: https://arxiv.org/pdf/2502.14296
项目链接: https://trustgen.github.io/
gitHub仓库: https://github.com/TrustGen/TrustEval-toolkit

英文摘要

Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components–metadata curation, test case generation, and contextual variation–to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.

中文摘要

生成基础模型（GENFM）已成为变革性工具。但是，他们的广泛采用引起了人们对跨维度的可信度的关键关注。本文提出了一个全面的框架，可以通过三个关键贡献解决这些挑战。首先，我们系统地审查了政府和监管机构以及行业实践和标准的全球AI治理法律和政策。基于此分析，我们提出了一系列针对GenFM的指导原则，这些原则是通过广泛的多学科合作开发的，该合作将技术，道德，法律和社会观点整合在一起。其次，我们介绍了Trustgen，这是第一个动态基准平台，旨在评估跨多个维度和模型类型的可信度，包括文本到图像，大语言和视觉模型。Trustgen利用模块化组件 - 固定策划，测试案例生成和上下文变化 - 启用自适应和迭代评估，克服了静态评估方法的局限性。使用Trustgen，我们在确定持续的挑战的同时揭示了可信赖性的重大进展。最后，我们对值得信赖的GenFM的挑战和未来方向进行了深入的讨论，该挑战和未来的方向揭示了可信赖性的复杂，不断发展的性质，强调了效用和可信赖性之间的细微折衷，并考虑了各种下游应用，确定了持续的挑战并为未来的研究提供了战略性研究。这项工作建立了一个整体框架，以提高Genai的信任度，为将GenFMS更安全，更负责任地集成到关键应用中铺平了道路。为了促进社区的进步，我们发布了该工具包进行动态评估。

SWE-Lancer：Frontier LLM可以从现实世界中的自由软件工程中赚取100万美元吗？

标题: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
作者: Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke
日期: 2025-02-17
ArXiv主页: https://arxiv.org/abs/2502.12115
论文链接: https://arxiv.org/pdf/2502.12115
gitHub仓库: https://github.com/openai/SWELancer-Benchmark

英文摘要

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks–ranging from 50 bug fixes to $32,000 feature implementations–and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

中文摘要

我们介绍了SWE-Lancer，这是UPWOWS的1,400多个自由软件工程任务的基准，其价值为100万美元，总计为100万美元。SWE-Lancer涵盖了这两个独立的工程任务 - 从50个错误修复到\ $ 32,000的功能实现以及管理任务，其中模型在技术实施建议之间进行选择。独立任务通过经验丰富的软件工程师对端到端测试进行分级，而管理决策则根据原始雇用工程经理的选择进行评估。我们评估模型性能，发现边境模型仍无法解决大多数任务。为了促进未来的研究，我们开源统一的Docker图像和公众评估拆分，Swe-Lancer Diamond（https://github.com/openai/openai/swelancer-benchmark）。通过将模型绩效映射到货币价值，我们希望SWE-Lancer能够对AI模型开发的经济影响进行更多的研究。

Zerobench：当代大型多模态模型的一个不可能的视觉基准测试

标题: ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
作者: Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie
日期: 2025-02-13
ArXiv主页: https://arxiv.org/abs/2502.09696
论文链接: https://arxiv.org/pdf/2502.09696
项目链接: https://zerobench.github.io/
gitHub仓库: https://github.com/jonathan-roberts1/zerobench

英文摘要

Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.

中文摘要

大型多模型模型（LMM）在解释图像时表现出重大缺陷，并且通过某些措施比小孩或动物的空间认知较差。尽管如此，他们还是在许多流行的视觉基准上获得了很高的分数，并且由于持续的模型进度激增而迅速侵蚀了净空。为了解决这个问题，迫切需要困难的基准测试，这些基准仍然相关。我们通过引入Zerobench-轻巧的视觉推理基准将这个想法达到极限，这对于当代边境LMM是完全不可能的。我们的基准包括100个手动策划的问题和334个难度较少的子问题。我们在Zerobench上评估了20个LMM，所有这些LMM的得分为0.0％，并严格分析错误。为了鼓励视觉理解的进步，我们公开发布了Zerobench。

Songgen：文本到歌曲一代的单阶段自动回归Transformers

标题: SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
作者: Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13128
论文链接: https://arxiv.org/pdf/2502.13128
项目链接: https://liuzh-19.github.io/SongGen/
gitHub仓库: https://github.com/LiuZH-19/SongGen

英文摘要

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

中文摘要

文本到歌曲的生成是通过文本输入创建人声和伴奏的任务，由于域的复杂性和数据稀缺而构成了重大挑战。现有方法通常采用多阶段生成程序，从而导致繁琐的培训和推理管道。在本文中，我们提出了Songgen，这是一种完全开源的，单阶段的自动回归Transformers，专为可控歌曲的生成而设计。提出的模型促进了对各种音乐属性的细粒度控制，包括仪器，流派，情绪和音色的歌词和文字描述，同时还提供了一个可选的三秒钟参考剪辑，用于语音克隆。在统一的自动回归框架中，Songgen支持两种输出模式：混合模式，该模式直接生成人声和伴奏的混合物以及双轨模式，该模式分别综合了它们，从而在下游应用程序中具有更大的灵活性。我们为每种模式探索各种代币模式策略，从而导致显着的改进和有价值的见解。此外，我们设计了具有有效质量控制的自动数据预处理管道。为了促进社区参与和未来的研究，我们将发布模型权重，培训代码，注释数据和预处理管道。生成的样品在我们的项目页面上显示，网址为https://liuzh-19.github.io/songgen/，该代码将在https://github.com/liuzh-19/songgen上找到。

学习现实世界人形机器人的站立恢复策略

标题: Learning Getting-Up Policies for Real-World Humanoid Robots
作者: Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta
日期: 2025-02-17
ArXiv主页: https://arxiv.org/abs/2502.12152
论文链接: https://arxiv.org/pdf/2502.12152
项目链接: https://humanoid-getup.github.io
gitHub仓库: https://github.com/RunpeiDong/humanup

英文摘要

Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world. Project page: https://humanoid-getup.github.io/

中文摘要

在可靠部署人形机器人之前，自动跌倒恢复是至关重要的先决条件。手工设计的控制器起床很困难，因为人类机器人可能会在跌倒后最终出现，并且有挑战性的人形机器人有望在陷入困境之后进行操作。本文开发了一个学习框架，以产生控制器，使类人动物机器人能够从各种地形上的不同配置中恢复站立。与以前的类人动力学学习的成功应用不同，起步任务涉及复杂的接触模式，这需要准确地对碰撞几何形状进行建模和稀疏的奖励。我们通过遵循课程的两阶段方法来应对这些挑战。第一阶段着重于在平滑度或速度 /扭矩极限的最小限制下发现良好的起步轨迹。然后，第二阶段将发现的动作完善为可部署的（即平滑而缓慢）的动作，这些动作对初始配置和地形的变化是可靠的。我们发现这些创新使现实世界中的G1类人生物机器人能够从我们考虑的两个主要情况下起床：a）脸部躺着，b）朝下躺下，都在平坦的，可变形的，湿滑的表面和斜坡上进行测试（例如，马虎草和雪地）。据我们所知，这是对现实世界中人类大小的类人机器人的学习政策的首次成功演示。项目页面：https：//humanoid-getup.github.io/

RAD：通过大规模3DGS的强化学习培训端到端驾驶策略

标题: RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
作者: Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13144
论文链接: https://arxiv.org/pdf/2502.13144
项目链接: https://hgao-cv.github.io/RAD/

英文摘要

Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.

中文摘要

现有的端到端自主驾驶（AD）算法通常遵循模仿学习（IL）范式，该范式面临因果混乱和开环差距等挑战。在这项工作中，我们建立了一个基于3DGS的闭环增强学习（RL）培训范式。通过利用3DGS技术，我们构建了真实的物理世界的感性数字复制品，使广告策略能够广泛探索状态空间，并学会通过大规模的试用和错误来处理分布外场景。为了提高安全性，我们设计了专门的奖励，以指导政策有效响应安全关键事件并了解现实世界的因果关系。为了更好地与人类驾驶行为保持一致，IL被纳入RL培训中，作为正规化术语。我们介绍了一个闭环评估基准，该基准由不同的3DGS环境组成。与基于IL的方法相比，RAD在大多数闭环指标中的性能更强，尤其是较低的碰撞速率。丰富的闭环结果在https://hgao-cv.github.io/rad上提供。

多模式MAMBA：通过二次到线性蒸馏实现的仅解码器多模态状态空间模型

标题: Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
作者: Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13145
论文链接: https://arxiv.org/pdf/2502.13145
gitHub仓库: https://github.com/hustvl/mmMamba

英文摘要

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE’s capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6times speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5times speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

中文摘要

最近的多模式大型语言模型（MLLM）取得了出色的性能，但由于其二次计算复杂性，增长的键值缓存要求以及依赖单独的视觉编码器而面临部署挑战。我们提出了Mmmamba，这是一个框架，用于通过使用中等学术计算资源从现有的MLLM进行逐步蒸馏来开发线性复杂性本地多模式空间模型。我们的方法使只有训练有素的单位MLLM直接转换为线性复杂体系结构，而无需进行预先训练的RNN LLM或视觉编码器。我们提出了一种播种策略，以从训练有素的Transformers和三阶段的蒸馏配方中雕刻曼巴，该配方可以有效地将知识从Transformers转移到Mamba，同时保留多模式功能。我们的方法还支持灵活的混合体系结构，这些架构将Transformers和MAMBA层相结合，以实现可定制的效率 - 性能权衡。MMMAMBA线性从基于Transformers的仅解码器的Hovle蒸馏出来，可以针对现有的线性和二次复杂性VLMS实现竞争性能，而Mmmamba-Hybrid则进一步提高了性能，从而接近Hovle的功能。与Hovle相比，Mmmamba Linear在103K令牌时表现出20.6倍的加速和75.8％的GPU记忆力减少，而Mmmamba-Hybrid实现了13.5摄氏度的加速和60.2％的存储器节省。代码和型号在https://github.com/hustvl/mmmamba上发布

通过主成分分析重新思考多样化的人类偏好学习

标题: Rethinking Diverse Human Preference Learning through Principal Component Analysis
作者: Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
日期: 2025-02-18
ArXiv主页: https://arxiv.org/abs/2502.13131
论文链接: https://arxiv.org/pdf/2502.13131

英文摘要

Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.

中文摘要

了解人类的偏好对于改善基础模型和建立个性化的AI系统至关重要。但是，偏好本质上是多元化和复杂的，使传统奖励模型很难捕获其全部范围。虽然细粒度的偏好数据可以有所帮助，但收集它是昂贵且难以扩展的。在本文中，我们引入了分解的奖励模型（DRMS），这是一种新颖的方法，可以从二元比较中提取不同的人类偏好，而无需精细的注释。我们的主要见解是将人类的偏好表示为向量，并使用主成分分析（PCA）对其进行分析。通过构建一个嵌入偏好和被拒绝响应之间差异的数据集，DRMS识别捕获偏好不同方面的正交基础向量。这些分解的奖励可以灵活地结合在一起，以与不同的用户需求保持一致，为传统奖励模型提供了可解释且可扩展的替代方案。我们证明DRM有效提取有意义的偏好维度（例如，有益，安全，幽默），并在没有其他培训的情况下适应了新用户。我们的结果突出显示DRM是个性化和可解释的LLM对齐的有力框架。

你并未充分利用Transformer的表示能力

标题: You Do Not Fully Utilize Transformer’s Representation Capacity
作者: Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
日期: 2025-02-13
ArXiv主页: https://arxiv.org/abs/2502.09245
论文链接: https://arxiv.org/pdf/2502.09245
gitHub仓库: https://github.com/corl-team/lime

英文摘要

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model’s overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

中文摘要

与RNN相比，RNN将前代币压缩为一个隐藏状态，Transformers可以直接访问所有以前的令牌。但是，标准Transformers仅使用前面一层的表示形式。在本文中，我们表明这种设计选择会导致表示形式崩溃并导致次优性能。为了解决此问题，我们引入了层集成内存（LIME），这是一种简单而强大的方法，可以保留该模型的整体内存足迹，同时通过允许从早期层中访问隐藏状态来扩大其代表性。通过跨各种架构和不同查找机制的广泛实验，我们在各种任务上表现出一致的性能提高。此外，我们对学习的表示动态的分析以及对深度电路的探索揭示了石灰如何整合跨层的信息，指出了有希望的未来研究方向。

查看全文

http://www.xdnf.cn/news/3419.html