日期:
来源:
话题:
相关度:
热度:
排序:
1
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
Show HN: Needle:我们将 Gemini 工具调用能力蒸馏到一个26M参数的模型
基础大模型推理部署

该项目将大型语言模型 Gemini 的工具调用能力通过蒸馏方式迁移到一个仅有 2600 万参数的轻量级“简单注意力网络”中。该模型可在本地 Mac/PC 上进行微调,并在生产环境中(Cactus 平台)实现每秒 6000 个 token 的预填充速度和 1200 token 的解码速度。模型权重和数据集生成方法完全开源,展示了一种高效、可部署的工具调用解决方案。

Skip to content Navigation Menu Platform Solutions Resources Open Source Enterprise Pricing Sign in Sign up cactus-compute / needle Public Notifications Fork 19 Star 490 Code Issues 1 Pull requests 8 Actions Projects Security and quality Insights cactus-compute/needle  main 29 Branches 0 Tags Code Folders and files Name Last commit message Last commit date Latest commit HenryNdubuaku fix citation a2cf1cb  ·  History 228 Commits assets updates docs Fix naming needle Fix download bug .gitignore restructure LICENSE updates README.md fix citation launch_train.sh updates pyproject.toml updates requirements.txt restructure setup added gpu Repository files navigation README MIT license Needle We distilled Gemini 3.1 into a 26m parameter "Simple Attention Network" that you can even finetune locally on your Mac/PC. In production, Needle runs on Cactus at 6000 toks/sec prefill and 1200 decode speed. Weights are fully open on Cactus-Compute/needle, as well as the dataset generation. d=512, 8H/4KV, BPE=8192 ┌──────────────┐ │ Tool Call │ └──────┬───────┘ ┌┴──────────┐ │ Softmax │ └─────┬─────┘ ┌─────┴─────┐ │ Linear (T)│ ← tied └─────┬─────┘ ┌─────┴─────┐ │ ZCRMSNorm │ └─────┬─────┘ ┌────────┴────────┐ │ Decoder x 8 │ │┌───────────────┐│ ││ ZCRMSNorm ││ ││ Masked Sel
Hacker News Hacker News · 11 小时前 · 相关度 80% 热度★★★★★
2
Grok now has skills
Grok现在拥有技能
行业资讯基础大模型

Elon Musk 宣布其 AI 助手 Grok 新增“技能”功能,表明 xAI 正在为 Grok 引入插件或工具调用能力,有望扩展模型的实际应用范围和交互能力。

Grok now has skills
X @elonmusk · 13 小时前 · 相关度 80% 热度★★★★★
3
Grok Voice is #1!
Grok语音功能排名第一!
行业资讯基础大模型

埃隆·马斯克在X平台宣布其公司xAI的Grok语音功能在评测中排名第一,突显该模型在语音交互能力上的领先地位。这一声明虽简短,但作为主流AI公司创始人的直接发声,对行业竞争态势有信号意义,表明大模型多模态语音功能正加速发展。

Grok Voice is #1!
X @elonmusk · 14 小时前 · 相关度 75% 热度★★★★★
4
Claude Platform on AWS
Claude 平台上线 AWS
基础大模型行业资讯

Anthropic 的 Claude 平台在 AWS 上部署,提供 Opus、Sonnet、Haiku 等系列模型,并集成 Claude Code、Claude for Slack 等工具与插件。平台面向企业客户推出 AI 智能体构建、编程辅助等解决方案,覆盖金融、医疗、教育等多个行业,通过 API 和多种定价计划提供服务。

Meet Claude Products Claude Claude Code Claude Cowork Claude Security Features Claude for Chrome Claude for Slack Claude for Microsoft 365 Skills Models Opus Sonnet Haiku Platform Overview Developer docs Pricing Console login Solutions Use cases AI agents Coding Departments Legal Security Industries Customer support Education Financial services Government Healthcare Life sciences Nonprofits Pricing Overview API Plans Pro Max Team Enterprise Resources Insights Blog Customer stories Anthropic news Learn Anthropic Academy Courses Tutorials Use cases Tools Connectors Plugins Connect Events Community Login Contact sales Contact sales Contact sales Try Claude Try Claude Try Claude Contact sales Contact sales Contact sales Try Claude Try Claude Try Claude Contact sales Contact sales Contact sales Try Claude Try Claude Try Claude Contact sales Contact sales Contact sales Try Claude Try Claude Try Claude Meet Claude Products Claude Claude Code Claude Cowork Claude Security Features Claude for Chrome Claude for Slack Claude for Microsoft 365 Skills Models Opus Sonnet Haiku Platform Overview Developer docs Pricing Console login Solutions Use cases AI agents Coding Departments Legal Security Industries Customer support Education Financial services Government Healthcare Life sciences Nonprofits Pricing Overview API Plans Pro Max Team Enterprise Resources Insights Blog Customer stories Anthropic news Learn Anthropic Academy Courses Tutorials Use cases Tools Connectors Plugins Connect Events Community
Hacker News Hacker News · 15 小时前 · 相关度 78% 热度★★★★★
5
行业资讯

NVIDIA CEO 黄仁勋在卡内基梅隆大学 2026 年毕业典礼上发表演讲,强调面对人工智能的发展不应恐惧,而应怀有乐观、责任与雄心去明智地引导未来。这一表态反映了顶级 AI 企业负责人对技术伦理和社会影响的积极看法。

“The answer is not to fear the future. The answer is to guide it wisely.” At @CarnegieMellon 2026 commencement, our CEO Jensen Huang shared why AI calls for optimism, responsibility, and ambition. https://t.co/cC9mmU5W17
X @nvidia · 15 小时前 · 相关度 82% 热度★★★★★
6
训练微调推理部署行业资讯

PyTorch创始人Soumith Chintala在X平台发布招聘,寻找超级计算工程师,负责构建支撑实时交互模型和大规模训练的基础设施,涉及调度、存储、网络、可靠性及分布式系统等关键技术方向。

Cluster magicians and GPU whisperers, come join us! We’re looking for supercomputing engineers to build the infrastructure behind real-time interactive models, Tinker, and large-scale training: scheduling, storage, networking, reliability, and distributed systems at scale.
X @soumithchintala · 17 小时前 · 相关度 75% 热度★★★★★
7
Symbolic learning is not a replacement for coding agents, it's a replacement for gradient descent & NNs: a low-level, completely general, extremely scalable new learning substrate.
符号学习不是编码智能体的替代品,而是梯度下降与神经网络的替代品:一种底层、通用、极可扩展的新学习基底
基础大模型训练微调

François Chollet 提出符号学习并非为替代编码代理而设计,而是作为梯度下降和神经网络的替代方案,是一种底层、完全通用且极可扩展的新学习基底。该观点暗示未来机器学习范式可能从基于梯度的连接主义转向符号学习,有望解决当前深度学习的可解释性与数据效率问题。作为知名 AI 研究者的技术趋势判断,具有重要的风向标意义。

Symbolic learning is not a replacement for coding agents, it's a replacement for gradient descent & NNs: a low-level, completely general, extremely scalable new learning substrate.
X @fchollet · 19 小时前 · 相关度 85% 热度★★★★★
8
Interaction Models
交互模型:一种可扩展的人机协作方法
基础大模型

Thinking Machines实验室发布交互模型研究预览,该模型原生支持实时人机协作,采用多流、微轮次设计,能够同时处理音频、视频和文本输入并实时思考与应答。论文声称该模型在智能水平与响应速度的综合表现上达到了当前最优(SOTA)。

THINKING MACHINES Tinker Connectionism News Join us Interaction Models: A Scalable Approach to Human-AI Collaboration Thinking Machines May 11, 2026 Introduction The collaboration bottleneck Capabilities › Our approach › Benchmarks Limitations and future work Tell us what you think, join us Citation Today, we’re announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time. We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness. The collaboration bottleneck # AI labs often treat the ability for AI to work autonomously as the model’s most important capability. Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks. METR, 2025. As a result, today’s models and interfaces aren’t optimized for humans to remain in the loop. A recent frontier model card states: “Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived [our model] as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities.” Autonomous interfaces are valuable, but in most real work, users can’t fully specify their requirements upfront and walk away—good results benefit from a collaborative process where the human stays in the loop, clarifying and giving feedback a
Hacker News Hacker News · 1 天前 · 相关度 75% 热度★★★★★
9
开发工具

OpenAI联合创始人Greg Brockman发布推文,简短提及AI用于帮助开发者构建由AI驱动的应用程序,但未提供具体技术细节或产品信息。

AI for helping you build apps powered by AI:
X @gdb · 1 天前 · 相关度 62% 热度★★★★★
10
Having an agent in your meeting is such a futuristic experience:
在会议中拥有智能体是一种极具未来感的体验
行业资讯

OpenAI联合创始人Greg Brockman分享了他对会议中加入AI智能体的切身感受,认为这种体验极具未来感。这体现了AI agent正在融入日常协作场景,反映出行业对AI代理能力的关注。

Having an agent in your meeting is such a futuristic experience:
X @gdb · 1 天前 · 相关度 70% 热度★★★★★
11
行业资讯

OpenAI CEO Sam Altman 近期在 X 上表示,新发布的 ChatGPT 模型在“性格”与“个性化”功能上的结合让他感到突破了以往的体验阈值,认为这带来了全新的交互感受。这一评价反映了 OpenAI 对助手类产品在个性化和用户粘性方面的重视。

speaking of things that have gotten over a threshold for me, the combo of the new ChatGPT model, personality, and personalization feels like a new thing
X @sama · 1 天前 · 相关度 82% 热度★★★★★
12
would you call it a superapp?
你会称它为超级应用吗?
行业资讯

Sam Altman 在社交平台发问,询问是否可将某个产品称为“超级应用”,外界推测其可能指代 OpenAI 的 ChatGPT 或其他 AI 产品,但未给出具体观点或技术细节。

would you call it a superapp?
X @sama · 1 天前 · 相关度 60% 热度★★★★★
13
开发工具行业资讯

PyTorch创始人Soumith Chintala分享了AI协作工具Thinky的三步计划:1. 提升人机交互带宽;2. 提高人类+AI的智能上限;3. 帮助人类在新世界中保持主角地位。当前正处于第一步,并展示了“互动模型”作为实时协作工具的预览。该计划旨在通过创新交互模式增强人机协同工作效率。

Thinky's secret plan: 1: Increase Human<->AI bandwidth 2: Raise ceiling of human+AI intelligence 3: Help humans continue as main-characters in the new world We are at Step 1. Interaction Models are great real-time collaborative tools for humans. Here's a preview:
X @soumithchintala · 1 天前 · 相关度 72% 热度★★★★★
14
基础大模型开发工具

Andrej Karpathy在推文中分享了一种与大语言模型交互的实用技巧:在查询末尾添加指令,要求模型将响应结构化为HTML格式,然后在浏览器中直接查看生成内容。他提到还成功尝试让LLM以幻灯片等形式呈现输出。该方法展示了通过提示工程拓展LLM输出形式的应用可能性。

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the
X @karpathy · 1 天前 · 相关度 78% 热度★★★★★
15
行业资讯
Pretty wild I got my PhD 4 years ago to the day. I feel very lucky that I got to do it and make my switch into AI. Lot's of people today in AI are underselling the value of going through the process of a PhD.
X @natolambert · 1 天前 · 相关度 60% 热度★★★★★
16
agents make for a surprisingly great product
AI代理成为出人意料的好产品
行业资讯

推文认为AI代理(agents)作为一种产品形态表现出乎意料地成功,暗示其在当前AI应用中的良好前景。

agents make for a surprisingly great product
X @gdb · 2 天前 · 相关度 70% 热度★★★★★
17
what if we name the next model "goblin" almost worth it to make you all happy...
如果我们将下一个模型命名为“goblin”会怎样
行业资讯

Sam Altman 在 X 上开玩笑地提出将下一个 AI 模型命名为“goblin”,并称这样做几乎值得为了让社区成员高兴。这暗示 OpenAI 正在考虑或即将发布新模型,虽然内容轻松,但反映了公司对社区反馈的重视。

what if we name the next model "goblin" almost worth it to make you all happy...
X @sama · 2 天前 · 相关度 70% 热度★★★★★
18
基础大模型行业资讯

Elon Musk在X平台发布简讯“Grok Imagine”,暗示其xAI公司的Grok模型即将推出图像生成功能。这标志着Grok向多模态能力拓展,可能采用类似DALL·E或Midjourney的文生图技术。虽然详情未知,但作为主流AI公司的产品动向,该消息值得关注。

Grok Imagine
X @elonmusk · 3 天前 · 相关度 75% 热度★★★★★
19
行业资讯开发工具

OpenAI CEO Sam Altman 分享了他使用 Codex 自动化完成代码任务后的积极体验。他提到,在陪伴孩子的间歇启动一批 Codex 任务,回来后发现任务均已自动完成,这种体验让他对 AI 的未来充满乐观。这反映了 AI 编程助手在提升个人生产力方面的实际价值。

kicking off a bunch of codex tasks, running around with my kid in the sunshine, and then coming back at naptime to find them all completed makes me very optimistic for the future
X @sama · 3 天前 · 相关度 78% 热度★★★★★
20
It was always the case that agency was self-compounding, but AI is magnifying the effect. Low-agency AI users further lose agency, high-agency AI users further gain agency.
AI 放大了能动性的自我强化效应:低能动性的 AI 用户进一步丧失能动性,高能动性的 AI 用户进一步增强能动性
行业资讯

François Chollet 指出,AI 正在放大个人能动性的自我强化效应:原本能动性较低的用户在使用 AI 后会进一步丧失能动性,而原本能动性较高的用户则会借助 AI 进一步增强自己的能动性。该观点从人与 AI 互动的角度,对 AI 普及带来的社会分化趋势给出了简短警示。

It was always the case that agency was self-compounding, but AI is magnifying the effect. Low-agency AI users further lose agency, high-agency AI users further gain agency.
X @fchollet · 3 天前 · 相关度 65% 热度★★★★★
21
行业资讯

Nathan Lambert 指出,开放软件(如开源模型)降低了 AI 的部署成本,而 OpenAI 等平台则降低了企业开发定制模型的门槛;不过企业尚在早期阶段摸索如何有效利用这一趋势。

Open software lowered deployment cost. Open AI lowers development cost. E.g. developing a bespoke model for an enterprise use case. We’re early in companies figuring out how to leverage this successfully.
X @natolambert · 18 小时前 · 相关度 65% 热度★★★★☆
22
开发工具行业资讯

Hugging Face官方宣布其Hub平台上的开放数据集数量已达100万个,强调开放模型需要开放数据,感谢AI社区的共同贡献。这一里程碑反映了开源数据生态的快速扩张,将为模型训练和评估提供更丰富的数据资源。

We've just hit 1M open datasets on the Hugging Face Hub 🎉 Open models need open data. Today we hit that milestone, together with the most incredible community in AI! 🤗 Onwards to the next million 🚀 https://t.co/PV6knP3XlJ
X @huggingface · 19 小时前 · 相关度 70% 热度★★★★☆
23
This is the demo that hits me as being genuinely different -- both model and user talking at once! Great stuff. Congrats on the release @thinkymachines
这条 demo 让我感觉真正与众不同——模型和用户同时交谈!祝贺 @thinkymachines 发布
行业资讯

Nathan Lambert 称赞 thinkymachines 发布的一个 demo,该 demo 实现了模型与用户同时进行语音对话的实时交互能力,展现出与传统异步对话不同的体验,认为其非常出色。

This is the demo that hits me as being genuinely different -- both model and user talking at once! Great stuff. Congrats on the release @thinkymachines
X @natolambert · 1 天前 · 相关度 75% 热度★★★★☆
24
基础大模型

知名AI研究员Sebastian Raschka表示将深入研究大模型中最有趣的架构组件,并询问社区是否遗漏了重要架构,可能涉及对当前主要模型架构的梳理与分析。

Back from a little family break! Lots has happened, and I’m planning to do a deeper dive into the most interesting architectural components (soon). Btw, are there any major architectures I missed below? https://t.co/HEhXUjxaDY
X @rasbt · 2 天前 · 相关度 70% 热度★★★★☆
25
GPT-5.5 Price Increase: What It Costs
GPT-5.5 价格上涨:成本是多少
行业资讯

本文围绕 OpenAI 推出的 GPT-5.5 模型 API 价格大幅上涨展开讨论。内容基于 OpenRouter 平台的成本数据,对比了 GPT-5.5 与早期模型在输入和输出 token 计费上的具体差异。文章还分析了涨价可能的原因,并收集了开发者社区的反馈,探讨模型性能提升是否与成本增加相匹配。

Skip to content OpenRouter / Fusion Models Chat Rankings Apps Enterprise Pricing Docs Sign Up GPT-5.5 Price Increase: What It Actually Costs Justin Summerville · 5/4/2026 We replicated the cost analysis we did on Opus on the new GPT-5.5 model. GPT-5.5 launched with a 2x price increase over GPT-5.4: input tokens increased from $2.50/M to $5.00/M and output tokens from $15/M to $30/M. OpenAI has also noted that the model is less verbose, producing shorter completions for the same tasks. Just as we did with Opus 4.7 we wanted to know what is the net impact on costs to users by analyzing usage that shifted from GPT-5.4 to GPT-5.5. We observed cost increases between 49-92%. The price increase is mitigated by the model generating 19-34% fewer completion tokens for longer prompts. Methodology: Same Switcher Cohort Approach We used the same approach as our Opus 4.7 analysis. We identified users whose top model by request count was GPT-5.4 prior to the 5.5 launch, who then switched to GPT-5.5 as their top model. This "switcher cohort" gives us a controlled before-and-after comparison of the same user base across model versions. Since GPT-5.4 and 5.5 use the same tokenizer family, we don't need to control for tokenizer differences. The comparison is direct: same users, same workflows, different model version. GPT-5.5 Is Less Verbose, But Only for Longer Prompts Using OpenRouter's consistent token counts, we measured how completion lengths changed between models: Prompt Size Median Completion (5.4) Median Completion (5.5) Change < 2K tokens 121 129 +7% 2K – 10K 140 213 +52% 10K – 25K 211 143 -32% 25K – 50K 185 150 -19% 50K – 128K 188 136 -28% 128K+ 215 143 -34% For prompts above 10K tokens, GPT-5.5 produces 19-34% fewer tokens. For shorter prompts, the pattern reverses: under 2K tokens completions are roughly the same length, and in the 2K-10K range they are 52% longer. Actual Cost Impact Using billed costs from requests in the switcher cohort, we calculated the
Hacker News Hacker News · 3 天前 · 相关度 72% 热度★★★★☆
26
Apple, Intel have reached preliminary chip-making deal
苹果与英特尔达成初步芯片制造协议
AI芯片硬件行业资讯

苹果与英特尔已达成初步协议,英特尔将为苹果生产定制芯片,这被视为英特尔代工业务(Intel Foundry)的重要突破。合作可能涉及基于先进工艺节点的AI推理或训练芯片,有望增强AI芯片供应链的多元化,并影响数据中心及消费端AI硬件生态。此举可能改变NVIDIA等厂商在AI芯片代工领域的市场格局。

HN 讨论:https://news.ycombinator.com/item?id=48066169 点赞:227 | 评论:144
Hacker News Hacker News · 3 天前 · 相关度 80% 热度★★★★☆
27
基础大模型学术论文

MiniCPM-o 4.5 是一个旨在实现实时、全双工、全模态交互的大模型,支持文本、语音、视觉等多种模态的同步输入与输出。该工作提出了端到端的多模态交互架构,重点解决低延迟、流式处理与模态对齐问题,通过结合语音编码、视觉理解与语言生成模块,实现更自然的人机交互。论文中可能包含了相关实验设计与性能评估,但推文未透露具体 SOTA 数据或模型规模。

MiniCPM-o 4.5 Towards Real-Time Full-Duplex Omni-Modal Interaction paper: https://t.co/wO3yzw0c2o https://t.co/VmxJh8qfl0
X @_akhaliq · 3 天前 · 相关度 78% 热度★★★★☆
28
Teaching Claude Why
教导Claude理解“为什么”:提升推理与对齐能力的训练方法
训练微调基础大模型

Anthropic 提出了一种以“为什么”为导向的训练方法,旨在增强 Claude 模型的推理可靠性和可解释性。该方法在常规答案学习的基础上,要求模型生成并理解背后的因果逻辑,可能集成了过程监督或思维链训练策略。这种训练调整直接提升了模型行为与人类意图的对齐程度,属于对齐技术中的新探索。该研究有助于推动大语言模型在推理一致性和安全性方面的进步。

Skip to main content Skip to footer Research Economic Futures Commitments Learn News Try Claude Alignment Teaching Claude why May 8, 2026 Last year, we released a case study on agentic misalignment. In experimental scenarios, we showed that AI models from many different developers sometimes took egregiously misaligned actions when they encountered (fictional) ethical dilemmas. For example, in one heavily discussed example, the models blackmailed engineers to avoid being shut down. When we first published this research, our most capable frontier models were from the Claude 4 family. This was also the first model family for which we ran a live alignment assessment during training;1 agentic misalignment was one of several behavioral issues that surfaced. Thus, after Claude 4, it was clear we needed to improve our safety training and, since then, we have made significant updates to our safety training. We use agentic misalignment as a case study to highlight some of the techniques we found to be surprisingly effective. Indeed, since Claude Haiku 4.5, every Claude model2 has achieved a perfect score on the agentic misalignment evaluation—that is, the models never engage in blackmail, where previous models would sometimes do so up to 96% of the time (Opus 4). Not only that, but we’ve continued to see improvements to other behaviors on our automated alignment assessment. In this post, we’ll discuss a few of the updates we’ve made to alignment training. We’ve learned four main lessons from this work: Misaligned behavior can be suppressed via direct training on the evaluation distribution—but this alignment might not generalize well out-of-distribution (OOD). Training on prompts very similar to the evaluation can reduce blackmail rate significantly, but it did not improve performance on our held-out automated alignment assessment. However, it is possible to do principled alignment training that generalizes OOD. For instance, documents about Claude’s constitution and fic
Hacker News Hacker News · 3 天前 · 相关度 85% 热度★★★★☆
29
Using Claude Code: The unreasonable effectiveness of HTML
使用Claude Code:HTML的超常有效性
开发工具

本文通过实际案例展示了Anthropic的AI编码助手Claude Code在前端开发中处理HTML代码的强大能力,作者发现利用大模型辅助生成HTML界面具有超乎预期的效率。文章探讨了如何将HTML作为高效的交互界面构建方式,并分享了使用Claude Code提升前端开发效率的技巧与观察。

Don’t miss what’s happening People on X are the first to know. Log in Sign up Article See new posts Conversation Thariq @trq212 Using Claude Code: The Unreasonable Effectiveness of HTML 867 3.2K 13K 9.6M Markdown has become the dominant file format used by agents to communicate with us. It’s simple, portable, has some rich text capability and is easy for you to edit. Claude has even gotten surprisingly good at using ASCII to make diagrams inside of markdown files. But as agents have become more and more powerful, I have felt that markdown has become a restricting format. I find it difficult to read a markdown file of more than a hundred lines. I want richer visualizations, color and diagrams and I want to be able to share them easily. I'm also increasingly not editing these files myself, but using them as specs, reference files, brainstorming outputs, etc. When I do make edits, I’m usually prompting Claude to edit them, which removes one of markdown’s largest benefits. I’ve started preferring HTML as an output format instead of Markdown and increasingly see this being used by others on the Claude Code team, this is why. (if you want to start with some examples, you can see a bunch here: https://thariqs.github.io/html-effectiveness, just be sure to come back and read more about why) Why HTML? Information Density HTML can convey much richer information compared to markdown. It can of course do simple document structure like headers and formatting, but it can also represent all sorts of other information such as: Tabular data using tables Design data with CSS Illustrations with SVG Code snippets with script tags Interactions using HTML elements with javascript + CSS Workflows using SVG and HTML Spatial data using absolute positions and canvases Images using image tags I would go so far as to say that there is almost no set of information that Claude can read that you cannot fairly efficiently represent with HTML. This makes it a highly efficient way for the model to co
Hacker News Hacker News · 3 天前 · 相关度 75% 热度★★★★☆
30
A recent experience with ChatGPT 5.5 Pro
最近ChatGPT 5.5 Pro的体验
基础大模型推理部署

一位用户分享了疑似OpenAI最新模型ChatGPT 5.5 Pro的实际使用体验,重点测试了其在复杂推理、代码生成和专业知识问答方面的表现。该体验通过非官方视角直观评估了新一代大语言模型的能力提升,为模型实际效果提供了民间参考。

Gowers's Weblog Mathematics related discussions « Group and semigroup puzzles and a possible Polymath project A recent experience with ChatGPT 5.5 Pro We are all having to keep revising upwards our assessments of the mathematical capabilities of large language models. I have just made a fairly large revision as a result of ChatGPT 5.5 Pro, to which I am fortunate to have been given access, producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me. The background is that, as has been widely reported, LLMs are now capable of solving research-level problems, and have managed to solve several of the Erdős problems listed on Thomas Bloom’s wonderful website. Initially it was possible to laugh this off: many of the “solutions” consisted in the LLM noticing that the problem had an answer sitting there in the literature already, or could be very easily deduced from known results. But little by little the laughter has become quieter. The message I am getting from what other mathematicians more involved in this enterprise have been saying is that LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques. I decided to try something a lit
Hacker News Hacker News · 3 天前 · 相关度 85% 热度★★★★☆
31
How open model ecosystems compound
开放模型生态系统如何产生复合效应
训练微调行业资讯

文章指出前沿模型的建设计算成本中约80%用于研发而非最终模型训练,引用Ai2的Olmo 3开发记录和Epoch AI的研究作为证据。在中国这样以开放模型为主的生态中,各实验室可以共享研发成果,避免重复投入,从而形成显著的成本结构优势。这种成本优势可能使实验室的持续研发周期超出外部观察者的预期,凸显开放生态系统的复合增长效应。

<h5>Note: Voice-overs for paywalled posts are available for paid subscribes in podcast apps if you click on settings on Interconnects, then manage your description. Thanks for listening!</h5><p>Most of the compute to build a leading frontier model comes from R&amp;D costs, rather than the compute to train the final, big model end-to-end. In an ecosystem like China, where all the leading players are open, this creates a potential meaningful advantage in cost structures that’ll let labs keep building longer than outside observers would expect.</p><p>There are two recent pieces of research, one from Ai2 <a href="https://arxiv.org/abs/2605.01158" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank">documenting the development of Olmo 3</a> and one from Epoch AI <a href="https://epoch.ai/gradient-updates/r-and-d-vs-training-compute" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank">studying public documentation of costs from various frontier labs</a>, that put the estimate of compute spent on R&amp;D rather than the final model at about 80% (with meaningful error bars).</p><figure><a href="https://substackcdn.com/image/fetch/$s_!VBXD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e980106-980c-4092-bbf1-f472323b8da0_1026x1283.png" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VBXD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e980106-980c-4092-bbf1-f472323b8da0_1026x1283.png 424w, https://substackcdn.com/image/fetch/$s_!VBXD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e980106-980c-4092-bbf1-f472323b8da0_1026x1283.png 848w, https://substackcdn.com/image/fetch/$s_!VBXD!,w_1272,c_limit,f_webp,q_auto:good,
Newsletter Interconnects AI · 18 小时前 · 相关度 82% 热度★★★☆☆
32
"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"
OncoAgent: 面向隐私保护的双层多智能体肿瘤临床决策支持框架
行业资讯开发工具

OncoAgent 是一个双层级多智能体框架,借助大语言模型为肿瘤临床决策提供支持。该框架通过分层代理分工与协作,处理复杂医疗任务,并内置隐私保护机制,可能采用本地推理或联邦学习保障数据安全。其设计展示了多智能体系统在严肃医疗场景下的应用潜力,旨在提升诊疗效率与决策可解释性。

"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support" - user: oncoagent-research tags: - oncology - multi-agent - LangGraph - RAG - QLoRA - AMD - open-source - clinical-ai - healthcare OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support Technical preprint · May 2026 · OncoAgent Research Group Abstract We present OncoAgent, an open-source, privacy-preserving clinical decision support system for oncology. OncoAgent combines a dual-tier fine-tuned LLM architecture with a state-of-the-art multi-agent LangGraph topology, a four-stage Corrective RAG pipeline over 70+ physician-grade NCCN and ESMO guidelines, and a three-layer reflexion safety validator enforcing a strict Zero-PHI policy. The system routes clinical queries through an additive complexity scorer to either a 9B parameter speed-optimised model (Tier 1) or a 27B deep-reasoning model (Tier 2), both fine-tuned via QLoRA on a corpus of 266,854 real and synthetically generated oncological cases using the Unsloth framework on AMD Instinct MI300X hardware (192 GB HBM3). Sequence packing on MI300X enabled full-dataset fine-tuning in approximately 50 minutes — a 56× throughput acceleration over API-based generation. Post-fix, CRAG document grading achieved a 100% success rate with a mean RAG confidence score of 2.3+. The complete system is 100% open source and deployable on-premises, eliminating proprietary cloud API dependency and preserving patient data sovereignty. Keywords: clinical decision support, oncology AI, multi-agent systems, retrieval-augmented generation, QLoRA, AMD ROCm, open-source healthcare AI, HITL safety, LangGraph, Corrective RAG 1. Introduction Oncology is one of the most information-dense and cognitively demanding domains in clinical medicine. The volume, heterogeneity, and rapid evolution of evidence-based guidelines — from the National Comprehensive Cancer Network (NCCN) to the European Society
AI Lab Hugging Face - Blog · 3 天前 · 相关度 78% 热度★★★☆☆
33
One-Step Generative Modeling via Wasserstein Gradient Flows
基于Wasserstein梯度流的一步生成建模
基础大模型学术论文

研究人员提出W-Flow框架,利用Wasserstein梯度流将参考分布演化为目标分布,并训练静态神经网络生成器实现一步生成。该框架采用Sinkhorn散度作为能量函数,有效捕捉分布差异并改善模式覆盖。在ImageNet 256×256上取得1.29 FID的SOTA结果,采样速度比多步扩散模型快约100倍,同时支持域迁移。理论证明有限样本训练可收敛至连续时间分布动态,为快速高保真生成模型提供了新范式。

arXiv:2605.11755v1 Announce Type: new Abstract: Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.
arXiv arXiv cs.LG · 6 小时前 · 相关度 95% 热度★★☆☆☆
34
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT:通过轨迹优化桥接LLM智能体的规划与执行
学术论文开发工具

PIVOT 提出一个自监督框架,通过计划-检查-进化-验证四个阶段,将行动轨迹视为可迭代优化的对象。框架利用环境执行得到的结构化损失和文本梯度,逐步优化轨迹以减少规划与执行之间的鸿沟。在 DeepPlanning 和 GAIA 基准上,该方法取得最优性能,带人类反馈时可相对提升约束满足度达 94%,自动变体仍保持显著增益,且 token 消耗仅为同类细化方法的 1/5~1/3。

arXiv:2605.11225v1 Announce Type: new Abstract: Large language model (LLM)-based agents frequently generate seemingly coherent plans that fail upon execution due to infeasible actions, constraint violations, and compounding errors over extended horizons. PIVOT (Plan-Inspect-eVOlve Trajectories) addresses this plan-execution misalignment through a self-supervised framework that treats trajectories as optimizable objects iteratively refined via environment interaction. The framework comprises four stages: PLAN generates candidate trajectories; INSPECT executes them and computes structured losses with textual gradients encoding plan-execution discrepancies; EVOLVE applies these signals to produce improved trajectories; and VERIFY performs a final global check against task constraints. A monotonic acceptance process ensures a non-decreasing solution quality. Empirical evaluations on DeepPlanning and GAIA demonstrate state-of-the-art performance: with human-in-the-loop (HITL) feedback, PIVOT establishes a strong upper bound up to 94% relative improvement in constraint satisfaction, while its fully autonomous variant retains substantial gains, showing that the core trajectory-refinement mechanism remains effective without external supervision. At the same time, PIVOT remains computationally efficient, requiring up to 3x to 5x fewer tokens than competing refinement methods. These findings establish that (self- or human-supervised) feedback-based trajectory optimization is a principled methodology for mitigating plan-execution gaps in autonomous agent systems.
arXiv arXiv cs.AI · 6 小时前 · 相关度 88% 热度★★☆☆☆
35
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
重新思考面向欺诈与反洗钱的LLMOps:构建合规级大模型服务栈
推理部署性能优化学术论文

本文针对欺诈检测和反洗钱合规场景中提示前缀密集、模式约束、证据丰富的特点,提出了一套工作负载感知的LLMOps服务栈。该栈基于自托管的开放权重模型(如Meta Llama、阿里Qwen),集成了vLLM风格运行时调优、PagedAttention、自动前缀缓存、多适配器服务、适配器与提示长度感知的批处理、休眠/唤醒生命周期管理、推测解码以及可选的预填充/解码分离。在公开合成AML数据集上的实验表明,优化后吞吐量从612-650请求/小时提升至3600请求/小时,P99延迟从31-38秒降至6.4-8.7秒,GPU利用率从12%提高到78%,证明合规级LLM性能取决于工作负载设计、服务优化和质量门控,而非单一的模型选择。

arXiv:2605.11232v1 Announce Type: new Abstract: Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.
arXiv arXiv cs.AI · 6 小时前 · 相关度 88% 热度★★☆☆☆
36
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
20/20 视觉语言模型:仅通过数据策展获得更好 VLM 的处方
训练微调学术论文

本文研究仅通过数据策展来提升视觉语言模型(VLM)的性能,固定架构和训练配方,仅改变训练数据。使用精心设计的策展流水线,在20个公开VLM基准上平均提升11.7个百分点,在自建评测套件DatBench上平均提升11.3个百分点。在2B规模下,策展模型以约17倍的训练计算量优势超越InternVL3.5-2B,并以约87倍的计算量逼近Qwen3-VL-2B性能。此外,策展还带来可靠性提升(训练种子间标准差降低67%)、分布外泛化增益(+7.2pp)、行为改善(更诚实、简洁)以及在推理成本上的帕累托优势。该工作表明数据策展是针对VLM的高杠杆优化手段,能够在极低的计算预算下达到近前沿精度。

arXiv:2605.11405v1 Announce Type: new Abstract: Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.
arXiv arXiv cs.LG · 6 小时前 · 相关度 88% 热度★★☆☆☆
37
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
超越GRPO与同策略蒸馏:一种经验性的稀疏到稠密奖励原则用于语言模型后训练
训练微调学术论文

该论文研究了在可验证训练数据有限的情况下,如何高效分配标注示例以提升语言模型后训练效果。作者提出“奖励密度原则”:应将稀缺的序列级稀疏奖励用于能力强的大模型进行探索性训练(如GRPO),然后通过稠密token级教师奖励将行为蒸馏到部署的小模型上,从而避免将稀疏奖励直接用于低准备度的策略。实验以Qwen3和Llama模型在数学推理任务上进行验证,发现先用8B教师模型进行RL提升,再通过“稠密桥接”(forward-KL预热加OPD)蒸馏至1.7B学生模型,比直接对同一学生运行GRPO效果更好,并且桥接后的学生进行后期稀疏RL训练也能进一步大幅提升准确率。该方法在MATH和AIME指标上均取得显著改善,为稀疏标注数据的利用提供了简单有效的分配准则。

arXiv:2605.12483v1 Announce Type: new Abstract: In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.
arXiv arXiv cs.LG · 6 小时前 · 相关度 88% 热度★★☆☆☆
38
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
ROMER:面向模拟存算一体系统上鲁棒MoE大模型的专家替换与路由器校准
AI芯片硬件推理部署性能优化

本文首次系统研究了混合专家(MoE)大模型在基于真实芯片噪声标定的模拟存算一体(CIM)系统上的表现,发现硬件噪声会严重破坏专家负载均衡并导致路由决策次优。为此提出ROMER框架,通过替换欠激活专家以恢复负载平衡,并利用百分位归一化重新校准路由器logits来稳定噪声下的路由。在多个MoE模型(DeepSeek-MoE、Qwen-MoE、OLMoE)上,ROMER在真实芯片噪声条件下将困惑度分别降低了58.6%、58.8%和59.8%,验证了其对不同MoE架构的有效性和泛化性。

arXiv:2605.11800v1 Announce Type: new Abstract: Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.
arXiv arXiv cs.LG · 6 小时前 · 相关度 88% 热度★★☆☆☆
39
Slicing and Dicing: Configuring Optimal Mixtures of Experts
切分与组合:配置最优的专家混合架构
基础大模型训练微调

本文对 Mixture-of-Experts(MoE)架构进行了首次系统性实验研究,覆盖超过 2000 次预训练运行,模型参数量达 6.6B。研究发现性能随总 MoE 参数增加持续提升,即使极端激活参数比(如 128)下亦然;最优专家大小仅取决于激活参数数,与总参数数几乎无关;共享专家、异质专家和负载均衡设置对性能影响较小,而无丢弃路由带来一致性收益。最终给出简化配方:重点关注专家数量和粒度,其他选择对最终质量影响甚微。

arXiv:2605.11689v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.
arXiv arXiv cs.LG · 6 小时前 · 相关度 88% 热度★★☆☆☆
40
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
保留音频所不能言:面向全模态LLM的上下文保留Token剪枝
推理部署性能优化

本文提出ContextGuard,一种面向全模态大模型(Omni-LLMs)的无须微调的推理时Token剪枝框架。该方法通过音频预测粗粒度视觉语义,剪去可从音频恢复的视频Token,同时保留音频无法表达的局部视觉细节,并进一步合并时间相似的视频Token。在Qwen2.5-Omni等多个模型上,ContextGuard在剪枝55%输入Tokens的情况下,仍能在五个基准上达到与全Token相当的性能,显著优于现有推理时剪枝方法。

arXiv:2605.11605v1 Announce Type: cross Abstract: Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
arXiv arXiv cs.AI · 6 小时前 · 相关度 87% 热度★★☆☆☆
41
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
EVOCHAMBER:多智能体系统在个体、团队与种群尺度上的测试时共同演化
开发工具学术论文

EVOCHAMBER 是一个免训练的多智能体测试时演化框架,在个体、团队和种群三个层次上协调智能体共同进化。其核心的 CODREAM 协议在团队任务失败或意见分歧时触发,通过非对称知识传递从强智能体填补弱智能体的能力缺口,同时保留专业化分工。团队级算子动态组建特定能力团队并选择协作结构,种群级算子通过分叉、合并、剪枝与播种管理智能体池。实验使用 Qwen3-8B 模型,在竞赛数学、代码和多领域推理任务上分别达到 63.9%、75.7% 和 87.1% 的准确率,相比最佳基线相对提升 32%,并自发涌现出稳定的专业化智能体结构。

arXiv:2605.11136v1 Announce Type: new Abstract: We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
42
Unlocking LLM Creativity in Science through Analogical Reasoning
通过类比推理释放大型语言模型在科学中的创造力
基础大模型学术论文

本文研究了大语言模型在开放科学问题解决方案生成中的模式崩溃现象,即生成多样性严重不足。作者提出类比推理(AR)方法,通过跨领域类比搜索新颖解决方案,在四个生物医学任务上将解决方案多样性指标提升90-173%,新颖方案比例超过50%(基线最低仅1.6%),并在扰动效应预测、细胞间通讯推断等任务上取得了多达13倍的性能提升或SOTA结果,表明AR可有效扩展解决方案搜索空间。

arXiv:2605.11258v1 Announce Type: new Abstract: Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($\rho$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
43
SkillGen: Verified Inference-Time Agent Skill Synthesis
SkillGen: 经验证的可推理时智能体技能合成
开发工具学术论文

SkillGen是一个多智能体框架,能够从基础智能体的执行轨迹中通过对比归纳自动合成可审计的技能。它利用成功和失败轨迹的对比,识别可复用的成功模式、反复出现的失败模式以及邻近成功但失败中缺失的行为,并迭代优化候选技能。其创新之处在于将技能视为干预措施,通过对比有无技能在相同实例上的表现,实证评估技能的净效果(兼顾修复和退化)。在多种智能体和数据集上,SkillGen持续提升了留出性能,优于现有技能生成基线,且生成的技能可跨模型迁移。

arXiv:2605.10999v1 Announce Type: cross Abstract: Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high-quality skills are still largely written by hand. We introduce SkillGen, a multi-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human-readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held-out performance, outperforms existing skill-generation baselines, and produces skills that transfer across models.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
44
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
测试时个性化:规模化失效的诊断框架与概率修复方案
推理部署学术论文基础大模型

本文研究大语言模型的测试时个性化(TTP),通过从个性化策略模型采样多个候选并利用个性化奖励模型选择最佳结果以扩展推理计算。理论证明Oracle选择可使期望效用随候选数对数增长,但标准奖励模型存在用户级坍缩和查询级奖励黑客两种失效模式。作者推导出统一规模化定律,将奖励模型的Best-of-N曲线分解为四个可测量量,并据此提出概率个性化奖励模型,通过学习方差有效缓解失效问题,在多个策略模型和个性化文本生成任务上实现一致的推理规模化提升。

arXiv:2605.10991v1 Announce Type: cross Abstract: Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model&#39;s Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
45
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP:通过前瞻性早期收敛令牌检测释放扩散语言模型并行性
推理部署性能优化

本文针对扩散语言模型(dLLM)并行解码中因高置信度阈值导致的可扩展性受限问题,提出训练无关的即插即用方法LEAP。通过统计发现大量token在去噪早期已收敛到正确结果但未达到标准置信阈值,LEAP利用未来上下文过滤和多序列叠加检测早期收敛token,显著降低推理延迟。基准测试表明LEAP平均减少约30%的去噪步骤,在GSM8K数据集上结合dParallel实现每步解码7.2个token且保持精度,打破了并行解码对高置信先验的依赖。

arXiv:2605.10980v1 Announce Type: cross Abstract: Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
46
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
理解和防止RLVR中的熵崩溃:基于同策略熵流优化
训练微调学术论文

本文从token级熵流的角度分析了强化学习可验证奖励(RLVR)中常见的熵崩溃问题,发现熵减少的token持续多于熵增加的token,导致熵流严重不平衡。基于此提出了同策略熵流优化方法(OPEFO),一种自适应平衡机制,根据每个更新对熵变化的贡献重新缩放更新幅度,同时保持严格同策略。在六个数学推理基准上,OPEFO改善了训练稳定性和最终性能。

arXiv:2605.11491v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
47
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
抛弃表演:基于探针过滤强化学习的忠实链式思考推理
推理部署训练微调学术论文

该论文指出当前推理模型普遍存在“推理剧场”现象,即模型内部早已确定答案,却产生大量对正确性无贡献的假性推理步骤,浪费推理算力且降低可解释性。为此,作者提出ProFIL方法,在GRPO强化学习过程中引入一个仅需在冻结基座模型上训练一次的多头注意力探针,用于检测内部激活中已做出决策后的“后承诺步骤”,并将探针得分超阈值的生成路径优势置零。实验在GSM8K、LiveCodeBench等多个推理任务和Llama-8B、Qwen-7B上验证,能减少11%-100%的推理剧场行为,提升忠实比例(如LiveCodeBench上独立裁判评估提高24个百分点),缩短链长度4%-19%,且保持或提升任务精度,效果优于简单的长度惩罚基线。

arXiv:2605.11467v1 Announce Type: new Abstract: Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink chain length* in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained *once* on the *frozen* base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. *Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable signal that suppresses theater while resisting the RL-obfuscation failure mode predicted by prior work.* Across four reasoning domains (GSM8K, LiveCodeBench, ToolUse, MMLU-Redux) and two model architectures (Llama-8B, Qwen-7B), ProFIL reduces post-commitment theater by **11--100%**, raises faithful-fraction (e.g., +24pp on LiveCodeBench under an independent Claude 3.7 Sonnet judge), and shortens chains by 4--19%, all while preserving or improving task accuracy. ProFIL also beats a matched length-penalty GRPO baseline, isolating the gain as semantic commitment-detection rather than chain compression. Probe weights, training configurations, and rollouts are released across all four domains.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
48
Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
语义奖励崩溃与自适应AI系统中认知完整性的保持
训练微调学术论文

本文针对基于RLHF和偏好优化的LLM中出现的表现性确定、校准漂移、谄媚等现象,提出“语义奖励崩溃”(SRC)概念,指出事实错误、不确定性表达、格式不满等语义上不同的评估信号会在优化中被压缩为通用奖励,导致模型倾向于压制可见的认知失败。作者类比制度性指标崩溃、软件可靠性工程等,主张将不确定性披露视为受保护的认知行为而非全局惩罚的任务失败。最后提出“宪法奖励分层”(CRS)这一领域感知的奖励框架,旨在保持自适应学习系统中差异化的认知归因,将其作为可检验的治理导向研究方向。

arXiv:2605.12406v1 Announce Type: new Abstract: Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
49
ADMM-Q: An Improved Hessian-based Weight Quantizer for Post-Training Quantization of Large Language Models
ADMM-Q:一种改进的基于Hessian的权重量化器,用于大语言模型后训练量化
推理部署性能优化学术论文

本文提出ADMM-Q,一种用于大语言模型后训练量化的新型权重算法,基于交替方向乘子法的组合变体进行逐层优化,通过算子分裂逐步降低重建误差并保证收敛,同时引入惩罚调度、预条件处理及局部搜索等增强手段以高效应用于LLM规模。ADMM-Q可无缝替换现有量化流程中的权重量化器,并能与范围裁剪、旋转和激活缩放等技术组合使用。在Qwen3-8B上,替换GPTQ后显著降低WikiText-2困惑度:W3A16权重量化场景从12.85降至10.06,W4A8 SmoothQuant设置从9.29降至8.68,W2A4KV4 SpinQuant场景从66.11降至19.42。

arXiv:2605.11222v1 Announce Type: new Abstract: Quantization is an effective strategy to reduce the storage and computation footprint of large language models (LLMs). Post-training quantization (PTQ) is a leading approach for compressing LLMs. Popular weight quantization procedures, including GPTQ and RTN, suffer in model utility, especially at aggressive quantization levels (sub-4-bit). We propose ADMM-Q, a novel weight quantization algorithm that considers the layer-wise quantization problem. Our algorithm is based on a combinatorial variant of the Alternating Direction Method of Multipliers (ADMM). Our operator-splitting procedure updates weights continuously to minimize the layer-wise reconstruction error, while gradually enforcing the quantization constraints with convergence guarantees. We propose additional algorithmic enhancements (e.g., penalty scheduling, preconditioning, and a local search post-processing step) to make ADMM-Q efficient at LLM scale. ADMM-Q is modular and can be used as a drop-in replacement for any weight quantizer within existing quantization pipelines: ADMM-Q is fully composable with existing techniques including range clipping, learned or random rotations, and activation scaling. Using ADMM-Q in place of GPTQ on Qwen3-8B, we decrease WikiText-2 perplexity in: (i) the W3A16 weight-only setting (12.85 $\rightarrow$ 10.06); (ii) the W4A8 SmoothQuant procedure (9.29 $\rightarrow$ 8.68); and (iii) the W2A4KV4 SpinQuant procedure (66.11 $\rightarrow$ 19.42).
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
50
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Muon没那么特殊:随机或反转的谱工作得同样好
训练微调学术论文

该论文挑战Muon优化器依赖精确非欧几何的观点,提出基于Schatten准范数的Freon优化器族,通过可证明最优的QDWH迭代近似实现,其最佳参数落在无法被酉不变LMO表示的准范数区域。进一步设计荒谬的Kaon优化器,用随机噪声取代奇异值,竟能匹配Muon的性能并保留经典收敛保证,证明严格几何结构并非训练性能的关键驱动。分析表明优化效果由对齐和下降势两个局部量控制,Muon成功在于保证了步长最优性而非追踪全局几何。

arXiv:2605.11181v1 Announce Type: new Abstract: The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon&#39;s performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
51
OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
通过CPU-GPU内存交换实现面向视觉-语言-动作模型的无OOM Alpamayo推理
推理部署性能优化学术论文

本文提出一种不修改模型的系统级优化框架,使视觉-语言-动作(VLA)大模型能在消费级GPU上进行高效推理。框架包含三个阶段:顺序需求分层将显存占用降至层粒度;流水线需求分层通过传输与计算重叠隐藏参数传输时间;GPU常驻层决策策略根据各模块的驻留收益分析消除剩余传输开销。同时提出性能预测模型,通过一次profiling即可确定最优常驻层数量与位置,预测误差低于1.3%。在RTX 5070Ti上运行NVIDIA的Alpamayo-R1-10B模型,相比Accelerate offloading最高获得3.55倍加速,并保持完整BF16精度。

arXiv:2605.11678v1 Announce Type: new Abstract: End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA&#39;s Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
52
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirēnes: 面向LLM推理的对抗自博弈与演化式干扰
训练微调学术论文

本文提出Seirēnes框架,通过参数共享的自博弈循环,让单一模型同时构建干扰性语境和解决核心问题,将上下文干扰从失败模式转变为训练信号,迫使模型超越表面模式匹配、习得鲁棒推理能力。该方法在7个数学推理基准和4B-30B参数规模上平均提升10.2、9.1和7.2分,且其生成的干扰语境能使GPT、Gemini等顶级闭源模型准确率下降4-5分,展示了揭示推理盲点的通用能力。整个框架通过持续互动维持了协同进化的课程,为提升LLM在非理想语境下的推理可靠性提供了新途径。

arXiv:2605.11636v1 Announce Type: new Abstract: We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes&#39; general ability to uncover reasoning models&#39; blind spots.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
53
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch:面向智能体 RAG 的基于搜索深度的课程式 rollout 采样
训练微调学术论文

本文针对智能体检索增强生成(RAG)系统的强化学习训练,提出 CuSearch 框架。它利用搜索深度作为检索子策略监督密度的无标注代理,通过搜索深度贪心分配(SDGA)在批次内重新分配更新预算,优先选择更深搜索轨迹的 rollout,形成与训练过程对齐的隐式课程。实验表明,CuSearch 相比标准 GRPO 在 ZeroSearch 上最高提升 11.8 个精确匹配点,证明了该方法的有效性。

arXiv:2605.11611v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at https://github.com/MrToser/CuSearch.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
54
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE:通过解聚合AF-Pipe并行实现计算-通信重叠的MoE训练
训练微调性能优化学术论文

DisagMoE提出一种解聚合的MoE训练系统,将注意力层和前馈网络层分离到不同的GPU组,引入多阶段单向多对多流水线,并利用计算-通信roofline模型平衡GPU与网络带宽分配,以最大化训练效率。该系统基于Megatron-LM实现,在16节点8卡H800集群上评估,针对多种MoE模型实现了最高1.8倍的训练加速,有效缓解了专家并行中的全对全通信瓶颈。

arXiv:2605.11005v1 Announce Type: new Abstract: Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers&#39; computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
55
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
大语言模型强化微调的内化课程判断
训练微调学术论文

本文提出一种名为METIS的框架,将课程判断内化为大语言模型的原生能力,通过预测提示内奖励方差来量化提示信息量,并利用近期训练结果作为轻量级上下文示例动态调整训练分配。该框架联合优化标准强化微调奖励与自我判断奖励,使模型学会“该学什么”,形成元认知式的训练闭环。在数学推理、代码生成和工具调用等多个离散与连续强化微调基准上,METIS一致地取得更优性能,并将收敛速度提升最高达67%,为LLM强化微调建立了一种简单高效的课程内化范式。

arXiv:2605.11235v1 Announce Type: new Abstract: In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy&#39;s training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
56
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
通过点互信息实现推理强化学习的反自蒸馏
训练微调学术论文

本文针对数学推理中自蒸馏增益不稳定的问题,通过点互信息分析发现特权上下文会膨化结构令牌的置信度同时压低关键的探索令牌(如“Wait”、“Let”),从而损害多步推理。为此提出反自蒸馏(AntiSD),不再最小化学生与教师的散度,而是使其最大化,从而逆转每个令牌的符号,并引入熵触发门控在教师熵崩塌时关闭此项。在4B至30B参数的五种模型中,AntiSD在数学推理基准上仅用GRPO基线2~10%的训练步数即达到同等准确率,最终准确率最高提升11.5分,为模型通过自身训练信号自举推理能力开辟了可扩展的路径。

arXiv:2605.11609v1 Announce Type: new Abstract: On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher&#39;s confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (&#34;Wait&#34;, &#34;Let&#34;, &#34;Maybe&#34;) that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline&#39;s accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
57
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
KV-Fold:用于长上下文推理的一步式KV缓存递归
推理部署学术论文

本文提出KV-Fold,一种无需训练的简单长上下文推理协议。该方法将KV缓存视为左折叠累加器,分块处理序列,每步仅需将新生成的键值对追加到缓存中,实现类似函数式编程中foldl的递归更新。在Llama-3.1-8B上,KV-Fold在128K token上下文的针在干草堆测试中达到100%精确检索,且可在单个40GB GPU上运行,证明了冻结预训练Transformer本身已具备稳定的KV缓存递归能力,无需修改架构或额外训练即可支持长上下文推理。

arXiv:2605.12471v1 Announce Type: new Abstract: We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
58
Solve the Loop: Attractor Models for Language and Reasoning
解决循环:吸引子模型在语言与推理中的应用
基础大模型训练微调

本文提出吸引子模型(Attractor Models),一种新型循环Transformer架构,通过骨干模块生成初始输出嵌入,再由吸引子模块求解不动点进行迭代精炼,并利用隐式微分获得梯度,使训练内存与有效深度无关,迭代次数自适应收敛。在语言模型预训练中,吸引子模型在多个规模上实现了对标准Transformer的帕累托改进,困惑度最高降低46.6%,下游任务准确率最高提升19.7%,且770M规模模型性能超过使用两倍数据训练的1.3B Transformer。在困难推理任务(如Sudoku-Extreme、Maze-Hard)上,仅27M参数的吸引子模型在1000样本训练后即可达到91.4%和93.1%的准确率,而前沿模型如Claude和GPT o3完全失败。此外,模型展现出“平衡内部化”现象,不动点训练使初始输出嵌入接近平衡态,推理时可直接移除求解器而性能损失极小。

arXiv:2605.12466v1 Announce Type: new Abstract: Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model&#39;s initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
59
Search Your Block Floating Point Scales!
搜索你的块浮点比例因子!
推理部署性能优化芯片软件栈

本文针对GPU加速器新支持的微缩放块浮点(BFP)格式,提出一种精细化的比例因子搜索策略ScaleSearch,利用尾数位最小化量化误差。该方法可集成到训练后量化(PTQ)和低精度注意力机制中,并在此基础上引入ScaleSearchAttention,一种基于NVFP4的加速注意力算法,在因果语言模型上实现近零性能损失。实验表明,ScaleSearch使NVFP4量化误差降低27%,在Qwen3-8B的MATH500任务上PTQ提升15个点,在Llama 3.1 70B的WikiText-2困惑度改善0.77点,在保持基线性能的同时显著提升量化精度。

arXiv:2605.12464v1 Announce Type: new Abstract: Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.
arXiv arXiv cs.LG · 6 小时前 · 相关度 85% 热度★★☆☆☆
60
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
通过结构化元认知实现通用代理中的深度推理
推理部署学术论文

本文提出了一种名为“深度推理”的推理时方法,通过结构化元推理为大型语言模型代理动态构建任务特定的推理支架,克服了现有硬编码支架灵活性不足的问题。该方法定义了一种形式语言,将元推理表示为对联想推理、形式计算和递归子问题求解的可执行分解,并作为上下文示例指导推理支架的即时构建。基于此实现的通用代理DOLORES在四个高难度推理基准上全面超越现有最强支架方法,平均性能提升24.8%,并能有效减少幻觉与过早终止,8B版本在超过半数场景中甚至超越了同系列32B模型的基线。

arXiv:2605.11388v1 Announce Type: cross Abstract: Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
61
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA: 通过小语言模型实现高效多轮大模型服务
推理部署性能优化

SOMA 框架针对多轮对话场景下大模型服务的延迟、内存与费用问题,提出利用会话早期轮次构建局部响应流形,并训练小模型在该局部区域替代大模型完成后续对话。具体通过软提示学习挖掘大、小模型响应分歧最大的方向,结合抗退化控制稳定训练,并将挖掘的案例蒸馏为局部 LoRA 微调,使推理时无需额外提示。框架包含一次性切换与漂移回滚门控机制,能在保证响应质量的前提下显著提升服务效率。

arXiv:2605.11317v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models&#39; responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.
arXiv arXiv cs.AI · 6 小时前 · 相关度 85% 热度★★☆☆☆
62
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems
语义训练差距:面向工业AI智能体系统的本体驱动工具架构
基础大模型开发工具学术论文

本文指出大语言模型智能体在制造业应用中存在“语义训练差距”,即模型通过训练习得领域词汇但缺乏对设备、参数、故障代码等操作语义的本体关系理解,导致多智能体协同时出现语义漂移。为此提出一种将制造业本体直接嵌入AI工具层的运行时约束架构,通过resolve、contextualize、annotate三操作接口与AIOps编排层强制语义一致性。实验中,无约束工具参数产生43%的领域标识符幻觉率,而本体约束方法将幻觉率降至0%,并在一套数字孪生分析平台上验证了单代码基、多域本体配置的可行性。

arXiv:2605.11234v1 Announce Type: new Abstract: Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics -- the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract -- resolve, contextualize, annotate -- with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
63
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter: 我们能在看到多模态模型回答前选出正确的模型吗?
推理部署学术论文

LatentRouter 提出一种多模态大模型路由方法,将路由问题建模为反事实多模态效用预测。它提取多模态路由胶囊,为每个候选模型引入能力令牌,并通过潜在状态交互估计模型被选中时的表现。路由策略支持性能导向和性能-成本均衡,并通过共享评分与可用性掩码处理候选池变化。在 MMR-Bench 和 VL-RouterBench 上的实验表现优于固定模型、特征级和学习型路由基线,尤其在视觉、布局敏感或推理需求强的多模态任务组上增益明显,潜在通信是主要贡献因素。

arXiv:2605.11301v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
64
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO:面向多样化与高效扩散对齐的轨迹匹配策略优化
训练微调学术论文

本文针对扩散模型强化学习对齐中常见的奖励黑客与模式坍缩问题,提出轨迹匹配策略优化(TMPO)方法。TMPO 采用 Softmax 轨迹平衡目标,将策略的轨迹概率分布匹配到奖励诱导的玻尔兹曼分布,从而在优化奖励的同时保持对可接受轨迹的覆盖。为加速大规模流匹配模型的多轨迹训练,引入动态随机树采样技术,通过共享去噪前缀和动态分支减少冗余计算。在人类偏好、组合式生成和文字渲染等任务上,TMPO 将生成多样性相对现有方法提升了 9.1%,并在下游指标与效率上达到最优的奖励-多样性权衡。

arXiv:2605.10983v1 Announce Type: cross Abstract: Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
65
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
无破坏的引导:基于机制信息的离散扩散语言模型干预
基础大模型学术论文

本文针对离散扩散语言模型的可控生成问题,发现均匀干预策略会损害生成质量,且在多属性联合控制时损伤加剧。作者通过训练稀疏自编码器分析四个模型(124M-8B参数),揭示不同属性(如主题、情感)在去噪过程中的形成时机存在显著差异,主题在最初2%步骤内定型,而情感则逐步涌现于20%的过程。基于此,提出一种自适应调度器,将干预集中于每个属性主动形成的步骤,其余步骤保持原生生成。该方法在七个引导任务上实现了精确控制,三属性同时控制时引导强度最高达93%,比最强基线高出15个百分点,同时保持生成质量。

arXiv:2605.10971v1 Announce Type: cross Abstract: Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93\% steering strength, beating the strongest baseline by up to 15\% points while preserving generation quality.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
66
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
FG-ExPO:基于自适应KL与高斯课程的前沿引导探索优先策略优化
训练微调学术论文

FG-ExPO针对GRPO算法在LLM数学推理训练中的两个缺点:固定KL系数限制探索、均匀采样忽视中等难度问题的信息量,提出两个轻量级组件。Accuracy-Conditioned KL Scaling根据批次准确率动态调整KL惩罚强度,Gaussian Curriculum Sampling以准确率约0.5的问题为中心分配高斯采样权重,将训练聚焦在学习前沿。在DeepSeek-R1-Distill-Qwen-1.5B和Qwen3-8B-Base上六个数学推理基准测试中,FG-ExPO一致优于原始GRPO,AIME 2025 pass@32指标从63.33%提升至76.67%,8B模型平均pass@32提升2.66,且pass@32相比pass@1增益更显著,验证了扩大有效探索空间的效果。

arXiv:2605.11403v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model&#39;s effective exploration space under a fixed inference budget.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
67
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine:面向高效全模态大语言模型的对齐感知协同压缩
推理部署学术论文

本文针对全模态大语言模型(Omni-LLM)推理成本高的问题,提出一种无需训练的音频-视频token压缩框架OmniRefine。该框架分为两个阶段:首先通过帧-音频相似度和动态规划将原生分块细化为跨模态对齐的压缩单元,然后在每个单元内进行模态感知的协同压缩以去除冗余并保留关键证据。在WorldSense基准上,仅保留44% token时准确率仍达46.7%,接近全量token基线,实现了更好的效率-性能权衡。

arXiv:2605.12056v1 Announce Type: new Abstract: Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
68
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
偏好优化中的虚假相关性学习:机理、后果以及通过平局训练的缓解方法
训练微调学术论文

本文针对直接偏好优化(DPO)等方法引发的模型依赖虚假特征(如谄媚、长度偏倚)问题,从理论上揭示了虚假学习的两条通道:均值虚假偏倚和因果-虚假相关性泄漏。作者证明这种依赖导致模型对分布偏移存在不可消除的脆弱性,并提出了基于平局数据增强的正则化策略——平局训练(tie training)。该方法能选择性抑制虚假学习而不损害因果学习,在对数线性模型和神经网络、大语言模型上均得到验证。

arXiv:2605.11134v1 Announce Type: new Abstract: Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today&#39;s language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal--spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model&#39;s dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
69
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
当推理痕迹变成表演:思维链是不完美监督渠道的步骤级证据
学术论文基础大模型

本文通过步级检测框架,在9个模型和7个推理基准上发现,思维链(CoT)的显式推理与模型隐含的答案承诺仅在61.9%的步骤上对齐。主要不匹配类型是“虚构延续”:模型答案已稳定,但后续仍生成看似在推理的文本,而这些后承诺文本对最终答案没有实质贡献。研究还表明,CoT实际效用越高的场景,其时间上的忠实性往往越差,提示CoT虽可提升能力,但作为监督或审计渠道存在不可靠性。

arXiv:2605.11746v1 Announce Type: new Abstract: Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
70
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
迈向稳定价值对齐:引入独立模块实现一致的价值引导
训练微调基础大模型

本文提出稳定价值引导Transformer(SVGT),通过在LLM中增加独立价值模块,将价值表示与主模型分离,避免动态残差流中价值属性的脆弱性。该模块通过桥接令牌显式引导生成轨迹,在多个骨干模型和安全基准上,有害分数降低超过70%且保持生成流畅性,验证了架构级价值建模的有效性。

arXiv:2605.11712v1 Announce Type: new Abstract: Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone&#39;s parameters or representation space. However, a critical gap exists: the model&#39;s residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone&#39;s internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
71
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
基于效率感知变分后验引导的高效大语言模型推理
推理部署性能优化学术论文

该论文针对大语言模型链式推理中的“过度思考”现象,提出VPG-EA框架,将高效推理形式化为变分推断问题,并引入效率感知的证据下界作为理论基础。框架采用参数共享的双流架构,通过交叉视图评估过滤伪高效路径,再将后验分布中的高效模式单向蒸馏到先验策略。在DeepSeek-R1-Distill-Qwen-1.5B和7B上的实验表明,相比最强基线,综合效率指标ε³分别提升了8.73%和12.37%,有效缓解了高质量样本稀疏的采样瓶颈。

arXiv:2605.11019v1 Announce Type: new Abstract: Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior&#39;s efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
72
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS:将预训练大语言模型重塑为循环潜变量精细优化模型
基础大模型推理部署

论文提出LoopUS后训练框架,将标准LLM无破坏地转换为循环架构,包含编码器、循环推理块和解码器。关键技术包括基于表示动态的块分解、抑制隐状态漂移的输入自适应门控、面向长循环的随机深度监督,以及实现自适应早停的置信度头部。该方法稳定了循环过程中的表示坍缩问题,在无需额外生成轨迹或从头训练的情况下提升推理性能。

arXiv:2605.11011v1 Announce Type: new Abstract: Looped computation shows promise in improving the reasoning-oriented performance of LLMs by scaling test-time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce \textbf{Looped Depth Up-Scaling} (LoopUS), a post-training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent-refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input-dependent selective gate to mitigate hidden-state drift; (3) random deep supervision for memory-efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non-looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning-oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see https://thrillcrazyer.github.io/LoopUS
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
73
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
弃牌还是英雄跟注:学习预算高效思考以实现自适应推理
推理部署训练微调学术论文

该论文提出 Budget-Efficient Thinking (BET) 框架,将自适应推理建模为不确定性下的计算投资,通过行为冷启动与群组相对策略优化 (GRPO) 结合投资成本感知奖励,使模型学会三种行为:对易解问题简短求解 (short solve)、对无望问题及早弃牌 (nice fold)、对困难但可解问题保留足够算力 (hero call)。在7个基准和3个基座模型上平均削减约55%的推理 tokens 且实现整体性能提升,并零样本迁移至科学问答和逻辑推理任务。

arXiv:2605.11625v1 Announce Type: new Abstract: Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model&#39;s capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
74
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch:训练研究智能体实现 LLM 实验配置自动化——从低成本学习,优化高成本
开发工具训练微调

本文提出 AutoLLMResearch 框架,用于自动化高成本大语言模型(LLM)实验的配置,涵盖架构设计与超参数调优。框架包含两个核心组件:一个多保真度实验环境 LLMConfig-Gym,覆盖四项关键 LLM 任务并积累超过一百万 GPU 小时的可验证实验结果;以及将配置研究建模为长时程马尔可夫决策过程的训练流程,激励智能体进行跨保真度外推推理。通过在留存实验上的广泛评估,该框架在有效性、泛化性和可解释性方面显著优于多种强基线,有望成为减少计算资源浪费、提升 LLM 实验效率的通用解决方案。

arXiv:2605.11518v1 Announce Type: new Abstract: Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
75
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
FibQuant:面向随机存取KV缓存压缩的通用向量量化
推理部署学术论文

针对长上下文推理中KV缓存导致的内存带宽瓶颈,本文提出了一种通用固定速率向量量化方法FibQuant。它保留了归一化-旋转-存储的接口,但用匹配球面Beta分布的径向-角向码本替代了标量量化表。该码本结合了Beta分位数半径、Fibonacci/Roberts-Kronecker准均匀方向与多起点Lloyd-Max精炼,在相同码率下严格优于标量产品量化。在GPT-2 KV缓存上,FibQuant能以5倍压缩达到0.99注意力余弦相似度,34倍压缩仍保持0.95;在TinyLlama-1.1B端到端推理中,4倍压缩下困惑度仅比fp16高0.10,8倍压缩时比标量TurboQuant低3.6倍困惑度,为随机存取KV缓存压缩提供了新的记忆-保真度边界。

arXiv:2605.11478v1 Announce Type: new Abstract: Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textsc{FibQuant} traces a memory--fidelity frontier from $5\times$ compression at $0.99$ attention cosine similarity to $34\times$ at $0.95$. End-to-end on TinyLlama-1.1B, it is within $0.10$ perplexity of fp16 at $4\times$ compression and has $3.6\times$ lower perplexity than scalar \textsc{TurboQuant} at $b = 2$ ($8\times$ compression), where scalar random-access quantization begins to fail.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
76
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
PRISM:一种将漂移分解为尺度、形状与头部的几何风险界
学术论文训练微调推理部署

PRISM利用大语言模型的线性输出头和骨干网络的近似等距结构,推导出交叉熵风险差距的封闭形式上界,并将表示漂移分解为尺度、形状和头部散度三个独立可测轴,对应低比特量化、LoRA遗忘和GGUF k-量化等不同失效模式。该方法不仅能对训练后变体进行排序,还指出主导轴以提示修复方向。在多个模型家族和基准上,对量化变体排序的平均Spearman相关系数为0.820,对LoRA遗忘变体为0.831;轴引导的形状正则器在缓解下游遗忘方面整体优于经验回放。

arXiv:2605.11608v1 Announce Type: cross Abstract: Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
77
Learning, Fast and Slow: Towards LLMs That Adapt Continually
快慢学习:迈向持续适应的大语言模型
训练微调基础大模型

本文提出了一种快慢学习框架(FST),将模型参数视为慢权重,优化的上下文视为快权重,使LLM能通过文本反馈快速吸收任务特定信息,同时保持基础模型的通用推理能力。FST在推理任务上比纯参数更新(RL)的样本效率高3倍,且性能渐近线更高,与基础模型的KL散度减少70%,显著减轻了灾难性遗忘。在持续学习场景中,FST能持续获取新任务,而纯参数RL会停滞。

arXiv:2605.12484v1 Announce Type: new Abstract: Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as &#34;slow&#34; weights and optimized context as &#34;fast&#34; weights. These fast &#34;weights&#34; can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
78
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM:基于标记条件极点的高效视觉状态空间模型
基础大模型学术论文

本文提出了一种名为TCP-SSM的结构化选择性状态空间模型,通过标记条件极点(Token-Conditioned Poles)使递归动态显式化。该模型利用实极点建模单调衰减或符号交替衰减,采用复共轭极点捕获阻尼振荡响应,并通过有界半径和角度调制将共享基础极点转换为依赖标记的极点,在保持稳定性的同时自适应视觉标记的记忆行为。结合分组极点共享与轻量级低秩输入通路,TCP-SSM在图像分类、语义分割和目标检测任务中,将Vision Mamba类模型的计算复杂度降低高达44%,同时保持或超越基线精度。

arXiv:2605.11563v1 Announce Type: cross Abstract: State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model&#39;s state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
79
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
利用随机矩阵理论检测神经网络在长时程“顿悟”中的过拟合
训练微调学术论文基础大模型

论文提出一种基于随机矩阵理论的新方法,无需训练或测试数据即可检测深度学习模型的过拟合。该方法对每层权重矩阵进行逐元素随机化,拟合随机化经验谱分布与 Marchenko-Pastur 分布,识别出违反自平均性的大异常值,称为“相关性陷阱”(Correlation Traps)。在长时程“顿悟”(grokking)的“反顿悟”(anti-grokking)阶段,这些陷阱的数量和规模随测试精度下降而增长,可区分良性与有害过拟合。作者发现,一些基础规模的大语言模型同样存在此类相关性陷阱,指示潜在的有害过拟合。

arXiv:2605.12394v1 Announce Type: new Abstract: Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}_{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the &#34;anti-grokking&#34; phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
80
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
相信批量,在线或离线策略:用于RL后训练的自适应策略优化
训练微调学术论文

本文针对大模型强化学习后训练中因训练与推理系统差异(如数值精度、采样细节)导致的脆弱性,提出一种批量自适应的策略优化方法。通过基于策略比率有效样本量的归一化统计量,自动调节信任区域与离策略正则化强度,无需预设超参数即可平衡更新稳定性与学习信号。该方法在多种实验设置下匹配或超越了精心调参的基线,且不引入新的目标超参数,并已开源。

arXiv:2605.12380v1 Announce Type: new Abstract: Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
81
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
MuonQ: 通过方向保真度优化增强低比特Muon量化
训练微调性能优化学术论文

针对Muon优化器状态对量化误差敏感的问题,提出方向保真度优化的低比特训练框架MuonQ。采用预量化归一化使每步引入等量量化误差,通过结构分解分别量化主奇异分量以保持方向信息,并利用μ律压扩量化提高密集区域的区分度,从而稳定实现4比特量化。在GPT和LLaMA类模型的预训练实验中,4比特MuonQ与全精度Muon达到相当的训练损失和下游任务精度,同时优化器状态内存占用减少7.3倍。

arXiv:2605.11396v1 Announce Type: new Abstract: The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon&#39;s optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $\mu$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon&#39;s optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
82
Targeted Neuron Modulation via Contrastive Pair Search
通过对比对搜索实现定向神经元调控
训练微调学术论文

本文提出对比神经元归因(CNA)方法,仅通过前向传播即可识别出MLP层中0.1%的神经元,这些神经元的激活可以最有效地区分有害与良性提示。在指令微调模型中,消融这些神经元回路可在标准越狱基准上将拒绝率降低50%以上,同时在不同干预强度下保持文本流畅性。作者在Llama和Qwen系列模型(1B到72B参数)上对比基座模型与指令模型,发现基座模型具备类似的后期层辨别结构,但调控这些神经元仅引起内容偏移,而不触发行为改变,表明对齐微调将原有的辨别结构转化为一个稀疏的、可靶向的拒绝门控。

arXiv:2605.12290v1 Announce Type: new Abstract: Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
83
SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization
SOAR: NVFP4 量化中面向精确重建的尺度优化框架
推理部署学术论文

本文针对大语言模型的NVFP4 4-bit微缩格式量化,提出后训练量化框架SOAR,解决现有方法因尺度选择僵化和量化-反量化尺度耦合处理导致的精度不足问题。核心创新包括:闭式联合尺度优化(CJSO),通过最小化重建误差从解析解中同时优化全局与块级尺度;以及解耦尺度搜索(DSS),将高精度量化尺度与其受约束的反量化对应解耦,并通过离散搜索缓解尺度量化带来的精度损失。在多个LLM上的实验表明,该方法在不增加硬件开销、相同内存占用下,持续超越现有NVFP4量化基线,实现更高精度。

arXiv:2605.12245v1 Announce Type: new Abstract: NVFP4 has recently emerged as an efficient 4-bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post-training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed-form Joint Scale Optimization (CJSO), which jointly optimizes global and block-wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high-precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at https://github.com/steven-bao1/SOAR.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
84
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD:面向大语言模型的强化引导能力蒸馏
训练微调性能优化学术论文

本文针对大语言模型的能力蒸馏提出 ReAD 框架,解决现有方法忽视能力间相互依赖的问题。研究发现蒸馏过程中存在系统性的跨能力转移,且额外计算预算带来的任务收益有限。ReAD 通过推断任务核心能力、自适应生成监督信号,并利用不确定性感知的上下文赌博机动态分配蒸馏资源,从而在固定标记预算下提升下游任务效用并减少有害溢出。

arXiv:2605.11290v1 Announce Type: cross Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student&#39;s broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
arXiv arXiv cs.AI · 6 小时前 · 相关度 82% 热度★★☆☆☆
85
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1:重新思考多领域强化学习中块大小对扩散大语言模型的作用
训练微调学术论文

该论文针对扩散大语言模型(dLLMs)在多领域强化学习后训练中的块大小冲突问题展开研究。作者提出了一个定量衡量领域块大小冲突的指标,并构建了包含样本级最佳训练块大小的数据集 Block-R1-41K。同时,发布了用于灵活RL后训练的基准 Block-R1,涵盖13个数据集、7种RL算法和多种dLLM骨干网络,并给出一种简单而有效的跨域后训练方法。

arXiv:2605.11726v1 Announce Type: new Abstract: Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms, and various different dLLM backbones are covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1, with the dataset released at https://huggingface.co/datasets/dLLM-R1/Block-R1-41K.
arXiv arXiv cs.LG · 6 小时前 · 相关度 82% 热度★★☆☆☆
86
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ:基于自监督动作排序的离线到在线强化学习
训练微调学术论文

该论文提出RankQ算法,通过自监督多目标排序损失增强时序差分学习,迫使Q函数学习动作的相对偏好,从而缓解离线到在线强化学习中因价值高估导致的有害更新。在稀疏奖励D4RL基准上,RankQ达到或超越七种现有方法;在基于视觉的机器人任务中,将预训练视觉-语言-动作(VLA)模型离线到在线微调后的仿真成功率平均提升42.7%,真实世界叠方块任务成功率从43.1%提升至84.7%。该方法为大型多模态模型的机器人技能学习提供了有效的微调策略。

arXiv:2605.11151v1 Announce Type: new Abstract: Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 84.7% relative to the VLA&#39;s initial performance.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
87
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
模板即本体:面向跨域制造业AI验证的可配置合成数据基础设施
学术论文

本文提出“模板即本体”原则,用单一Python配置模块同时作为制造模拟器规范和AI工具运行时模式,从构造层面保证对齐。设计了一条五层数据管道(模拟、PostgreSQL、CDC/Iceberg湖仓、星型模式和12个参数化AI工具),生成符合ISA-95/IEC 62264标准的因果一致数据,覆盖4个操作域共66种实体类型。在航空、制药、汽车等6个行业模板上验证了框架的通用性,校准实验确认KPI可控。一项控制幻觉实验(72次工具调用,Qwen3-32B)显示,本体约束参数将工具参数虚构率从43%降至0%(p<10⁻¹²),该效果为架构级别保证,适用于任意模型。

arXiv:2605.11259v1 Announce Type: new Abstract: LLarge language model (LLM)-based AI agents deployed in manufacturing environments require populated, schema-correct data for validation, yet production MES data is proprietary, privacy-encumbered, and vendor-specific. This paper introduces the Template-as-Ontology principle: a single Python configuration module (700-770 lines, 45 validated exports) serves simultaneously as the specification for a time-stepped manufacturing simulator and as the runtime domain schema for AI analytics tools, producing alignment by construction rather than integration. We formally define the domain template as a typed relational configuration schema and prove that structural alignment between simulation and tool layers is guaranteed by single-source consumption. A five-layer pipeline--simulation, PostgreSQL, CDC/Iceberg lakehouse, star schema, and 12 parameterized AI tools--generates causally coherent, MES-shaped data spanning 66 entity types across four operational domains mapped to ISA-95/IEC 62264. We validate the architecture with six industry templates (aerospace, pharma, automotive, electronics, beverages, warehousing) running on identical framework code. Calibration experiments (60 runs, 10 seeds per template) confirm parametric controllability: observed KPIs fall within configured ranges across all templates. A controlled hallucination experiment (72 tool invocations, Qwen3-32B) demonstrates that ontology-constrained parameters eliminate tool-parameter fabrication (0% constrained vs. 43% unconstrained hallucination rate for the evaluated model, Fisher&#39;s exact test p &lt; 10^-12); the 0% constrained rate is an architectural guarantee that holds for any model. The framework provides a reusable data layer for discrete manufacturing AI validation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
88
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
用于越狱大型语言模型的少样本完全良性DPO攻击
训练微调基础大模型

本文提出一种利用直接偏好优化(DPO)对大型语言模型进行越狱攻击的新方法。攻击仅使用10个完全无害的偏好对(包含无害提示、有用回答与拒绝回答),数据量极小且请求看似合法以减少过度拒绝,隐蔽性极高。然而,DPO直接优化模型偏好有用回答而非拒绝,广泛抑制了拒绝行为,将危害传递至微调数据外的有害提示。实验在OpenAI系列模型上取得高达81.73%的攻击成功率,且成本极低(最低0.1美元),揭示了当前安全对齐在偏好微调中的严重脆弱性。

arXiv:2605.10998v1 Announce Type: cross Abstract: Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI&#39;s fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only \$1.7, \$1.7, \$0.3, and \$0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
89
Fast MoE Inference via Predictive Prefetching and Expert Replication
通过预测性预取和专家复制实现MoE快速推理
推理部署性能优化学术论文

本文提出一种动态专家复制策略,用于加速MoE模型的推理。该方法预测即将过载的专家并进行复制,使多批token能同时在多个层上并发处理,从而提升并行度、减少GPU空闲时间。在Switch-base-128和Switch-base-256等大规模MoE模型上实验,实现了接近100%的GPU利用率,推理速度最高提升3倍,同时保持基线90-95%的性能水平。

arXiv:2605.11537v1 Announce Type: new Abstract: The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on large-scale MoE models, including Switch-base-128 and Switch-base-256, demonstrate that our method achieves near-complete GPU utilization (approx 100%), leading to upto 3x improvement in inference speed while preserving approximately 90-95% of the performance of baseline architectures
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
90
Efficient Adjoint Matching for Fine-tuning Diffusion Models
面向扩散模型微调的高效伴随匹配方法
训练微调学术论文

该论文针对扩散模型在文本到图像生成中基于奖励梯度微调的伴随匹配(AM)方法计算成本高的问题,提出Efficient Adjoint Matching(EAM)。作者观察到瓶颈源于预训练模型的非平凡基漂移,通过将随机最优控制问题重构为线性基漂移和修正终端成本,EAM在训练时仅需几步确定性ODE求解器采样,并给出闭式伴随解,消除了反向伴随模拟。实验显示EAM收敛速度最高可达AM的4倍,在PickScore、ImageReward、CLIPScore等指标上持平或超越AM,显著提升了微调效率。

arXiv:2605.11480v1 Announce Type: new Abstract: Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
91
Latent Chain-of-Thought Improves Structured-Data Transformers
潜在思维链提升结构化数据 Transformer
推理部署学术论文

论文提出一种潜在思维链(latent chain-of-thought)方法,用于增强结构化数据(时间序列、表格)Transformer 的测试时计算。设计了一种循环方案:初始前向传播后将查询位置的隐藏状态压缩为反馈 token,附加到输入并再次处理,允许在预测前进行多轮潜在计算。在 36 个数据集上,该方案在 8/9 个时间序列数据集上平均提升 10.99%,在 22/27 个表格数据集上平均提升 5.31%,表明思维链是扩展结构化数据推理阶段计算的有效维度。

arXiv:2605.11262v1 Announce Type: new Abstract: Chain-of-thought and more broadly test-time compute are known to augment the expressive capabilities of language models and have led to major innovations in reasoning. Motivated by this success, this paper explores latent chain-of-thought as well as the impact of depth and looping for time-series and tabular data. We propose a recurrent scheme in which a structured-data transformer, after an initial forward pass, compresses its query-position hidden states into feedback tokens that are appended to the input and processed again, allowing multiple rounds of latent computation before prediction. We compare CoT models against a same-depth no-CoT baseline, a deeper baseline matched to the CoT model in effective depth, and a looped transformer with weight-tied recurrence but no additional chain-of-thought tokens. Across 36 datasets in time-series forecasting and tabular prediction, latent chain-of-thought improves over the baseline on 8/9 time-series datasets (+10.99\% average gain) and 22/27 tabular datasets (+5.31\% average gain). Across both settings, the CoT models perform the best on average. These results demonstrate that chain-of-thought is a useful axis for scaling test-time compute for structured data.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
92
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
评估失败的尺度定律:为何简单平均在数据稀疏与题目难度差异下失效,以及项目反应理论如何恢复跨领域真实排名
学术论文开发工具

本文指出基准评估中普遍采用简单平均的做法,在评估数据稀疏且题目难度差异大时会导致严重误导性的排名。通过在NLP(GLUE)、临床试验、自动驾驶和网络安全四个领域的仿真实验,作者展示了简单平均排名与真实排名的Spearman相关系数从完全覆盖时的1.000降至67%覆盖率与高难度异质性条件下的0.809,而标准双参数Logistic(2PL)项目反应理论(IRT)模型在所有条件下均保持ρ≥0.996。150种条件的网格扫描进一步确认了稀疏度S与难度差异D存在交互效应形成的失效曲面,IRT则始终维持ρ≥0.993。文章讨论了该发现对Physical AI基准测试的启示,那里的评测矩阵常不完整且难度差异极大。

arXiv:2605.11205v1 Announce Type: new Abstract: Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $\rho$ between simple-average rankings and ground-truth rankings degrades from $\rho = 1.000$ at 100% coverage to $\rho = 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $\rho \geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5, 5.0]$ confirms that ranking error forms a failure surface with a strong $S \times D$ interaction ($\gamma_3 = +0.20$, $t = 13.05$), while IRT maintains $\rho \geq 0.993$ throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
93
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
变分线性注意力:面向长上下文Transformer的稳定联想记忆
基础大模型性能优化

本文提出变分线性注意力(VLA),将线性注意力的记忆更新重新定义为在线正则化最小二乘问题,并通过Sherman-Morrison秩1公式维护自适应惩罚矩阵,从理论上保证了雅可比矩阵谱范数恒为1且状态范数自限。在长序列(1000 tokens)上,VLA的Frobenius范数较标准线性注意力降低109倍,在多查询联想召回任务中准确率显著优于DeltaNet和标准线性注意力,尤其在每头记忆容量边界仍保持62%准确率。通过Triton融合核实现,相比串行Python代码获得14倍加速,且延迟在约43000 tokens时低于softmax注意力。

arXiv:2605.11196v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $\|S_t\|_F$ by $109\times$ relative to standard linear attention at $T{=}1{,}000$, achieves near-perfect exact-match accuracy on multi-query associative recall within the effective per-head memory regime ($n_\text{pairs} &lt; d_h$), maintaining substantially higher retrieval performance than DeltaNet and standard linear attention under increasing memory load, and maintains 62\% accuracy at the per-head capacity boundary. A Triton-fused kernel achieves $14\times$ speedup over sequential Python and $\mathcal{O}(T)$ scaling, crossing below softmax attention latency at approximately 43\,000 tokens.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
94
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
重新思考监督粒度:基于大语言模型的定理证明中的段级学习
训练微调学术论文

该论文针对在Lean 4中使用大语言模型的自动定理证明,提出一种段级监督训练数据构建策略,从证明轨迹中提取局部连贯的证明段用于训练策略模型。在STP、LeanWorkbook和NuminaMath-LEAN数据集上训练的模型,在miniF2F上的证明成功率分别达到64.84%、60.90%和66.31%,均优于步级和全证明基线。该方法在推理时还可用于触发现有步级模型的短展开,将BFS-Prover-V2-7B的成功率从68.77%提升至70.74%,InternLM2.5-StepProver从59.59%提升至60.33%,同时降低推理成本。

arXiv:2605.11905v1 Announce Type: new Abstract: Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
95
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
基于失败轨迹的在线策略自我进化实现智能体安全对齐
训练微调学术论文

本文提出FATE框架,利用工具使用型LLM智能体的失败轨迹进行在线策略自我进化,无需专家演示。框架将验证器评分的失败转化为修复监督信号,并引入Pareto前沿策略优化(PFPO)来保持安全与效用平衡。实验在AgentDojo、AgentHarm等基准上,攻击成功率降低33.5%,有害顺从性降低82.6%,轨迹级安全诊断提升6.5%,显著提升了不同规模模型的安全表现。

arXiv:2605.11882v1 Announce Type: new Abstract: Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
96
Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers
事后提示蒸馏:从无CoT答案中为SWE智能体构建支架式推理
训练微调学术论文

本文提出一种事后提示蒸馏(HHD)方法,仅需无思维链标注的问答对即可提升软件工程(SWE)智能体的长期规划与推理能力。HHD利用模型自身的失败轨迹合成“事后提示”,为后续的策略性轨迹提供支架,并通过自蒸馏使模型在测试时不依赖提示引导。在SWE-bench Verified基准上,HHD相比迭代拒绝采样微调等基线取得8%的绝对提升,且意外地在多语言任务上也有最大幅度的泛化增益,展示了从简单数据中诱导专业推理策略的有效性。

arXiv:2605.11556v1 Announce Type: new Abstract: Solving complex long-horizon tasks requires strong planning and reasoning capabilities. Although datasets with explicit chain-of-thought (CoT) rationales can substantially benefit learning, they are costly to obtain. To address this challenge, we propose Hindsight Hint Distillation (HHD), which only requires easy-to-obtain question-answer pairs without CoT annotations. Inspired by how human teachers use student mistakes to provide targeted guidance, HHD synthesizes hindsight hints from the model&#39;s own failed self-rollouts and uses them to scaffold on-policy rollouts that successfully complete the tasks. The model then self-distills these scaffolded trajectories and generalizes to new problems without hint guidance. Experiments show that HHD significantly outperforms iterative RFT and trajectory-synthesis baselines, achieving an absolute improvement of 8\% on SWE-bench Verified, while all baselines improve by only around 2\%. Notably, the reasoning strategies induced by HHD generalize effectively to out-of-distribution tasks, yielding the largest gains on SWE-bench Multilingual despite no training on multilingual data. These results demonstrate that HHD can effectively synthesize expert-like reasoning from CoT-free data and substantially improve long-horizon performance.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
97
训练微调学术论文

本文提出 Macro 框架,将直接偏好优化(DPO)应用于多语言自生成反事实解释(SCE),通过组合评分函数构建偏好对,平衡解释的有效性与最小性。在四个大语言模型和七种语言上的实验显示,Macro 相比思维链基线平均提升有效性 12.55%,且不损害最小性,优于监督微调。进一步分析表明该方法提升了跨语言扰动对齐并减少了常见生成错误。

arXiv:2605.11632v1 Announce Type: cross Abstract: Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
98
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
从通用关联到输入特异性信用:在策略自蒸馏中的研究
训练微调

本文针对在策略自蒸馏(on-policy self-distillation)中的 token 级奖励机制进行理论分析,揭示该奖励实质为贝叶斯滤波增量,其轨迹和等于响应与反馈的点互信息(pMI)。进一步发现 pMI 可由输入特异性推理或输入通用快捷方式提升,因此提出 CREDIT(对比蒸馏奖励),通过批次对比基线分离输入特异性分量。在代码、科学推理和工具使用等基准上,CREDIT 在两种模型家族上以几乎可忽略的额外计算代价取得了最强综合性能。

arXiv:2605.11613v1 Announce Type: cross Abstract: On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.
arXiv arXiv cs.AI · 6 小时前 · 相关度 80% 热度★★☆☆☆
99
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
路由器学习其专家的几何结构:稀疏专家混合中的几何耦合
训练微调基础大模型学术论文

该论文研究了稀疏专家混合(SMoE)模型中路由决策的形成机制,发现了路由器与对应专家之间的几何耦合现象,即选定专家的路由权重和专家权重沿相同输入方向接收梯度。在1B参数SMoE模型的实际训练中,更高的路由分数能预测更强的专家神经元激活,验证了该耦合的存在。研究进一步揭示了辅助负载均衡损失会破坏这种几何结构,导致路由器方向趋于相似。基于此,作者提出了一种无参数的在线K-Means路由器,通过运行平均隐藏状态和余弦相似度分配令牌,在仅略微增加困惑度的情况下实现了最低的负载不平衡,表明几何耦合是路由学习的关键部分。

arXiv:2605.12476v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
100
BSO: Safety Alignment Is Density Ratio Matching
BSO:安全对齐即密度比匹配
训练微调学术论文

论文将大模型安全对齐问题转化为密度比匹配问题,提出 Bregman 安全优化(BSO)损失函数族,仅需一个额外超参数且无需辅助模型即可单阶段训练,理论上可恢复最优安全策略。实验表明BSO在多个安全对齐基准上持续改善了安全性与有用性的权衡,并统一了现有多种安全感知方法。

arXiv:2605.12339v1 Announce Type: new Abstract: Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
101
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
网格游戏:多网格策略在大型语言模型量化中的威力
推理部署学术论文

本文提出了一种多网格量化方法,在微缩放4-bit格式(如NVFP4、MXFP4)基础上,允许每组数值动态选择两个或多个4-bit网格中更优的一个,以提升量化精度。作者形式化了2的幂次网格(PO2)问题,并给出了几种实用网格族:PO2(NF4)将标准正态网格与学习到的网格配对,MPO2完全从权重和激活中学习网格对,PO2(Split87)为非对称显式零网格,以及SFP4将NVFP4与两个偏移变体组合为可部署于TensorCore的三元组。在开源模型的后训练量化和Llama类模型的预训练量化实验中,自适应网格在仅权重和权重+激活量化下均取得比单网格FP4一致的精度提升。

arXiv:2605.12327v1 Announce Type: new Abstract: A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the &#34;better&#34; among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
102
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR:通过自蒸馏实现LLM智能体的粒度自适应优势重加权
训练微调

提出GEAR框架,用于LLM智能体的强化学习训练中的细粒度信用分配。该方法利用策略学生与真实条件教师之间的自蒸馏差异信号,自适应确定token级和segment级的优势权重,在出现语义偏差的锚点处自动调整调控粒度。在8个数学推理和工具使用基准上,Qwen3 4B和8B模型的实验表明,GEAR一致优于标准GRPO和现有信用分配方法,在GRPO基线准确率较低的任务上提升可达约20%。

arXiv:2605.11853v1 Announce Type: new Abstract: Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment&#39; s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
103
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
强化学习微调中的熵极性:方向、非对称性与控制
训练微调学术论文

本文针对大语言模型在可验证奖励强化学习(RLVR)中的探索控制问题,建立了策略熵变化的token级理论框架,提出“熵极性”概念,用于预测采样更新是扩大还是收缩策略熵。研究发现强化高频高概率token会导致熵收缩,而低概率采样或更强的分布修正才引发熵扩张,揭示了一种结构不对称性。基于此,作者提出极性感知策略优化方法(PAPO),通过优势重加权自适应调控优化压力,在数学推理和智能体任务上取得比基线更好的训练效率与奖励提升。

arXiv:2605.11775v1 Announce Type: new Abstract: Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
104
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
基于角色条件的对抗性提示:面向多身份红队的对抗发现与缓解方法
训练微调学术论文

本文提出PCAP方法,通过引入多样化的攻击者角色(如医生、学生、恶意用户)和策略并行搜索,生成更真实多样的越狱攻击提示,并自动记录元数据以构建防御数据集。在GPT-OSS 120B上,PCAP将攻击成功率从57%提升至97%,提示多样性提高2-6倍。利用生成的数据微调轻量适配器后,模型安全召回率从0.36提升至0.99,F1从0.53提升至0.96,仅引入极低误报,形成从漏洞发现到自动化安全对齐的闭环流程。

arXiv:2605.11730v1 Announce Type: new Abstract: Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.
arXiv arXiv cs.LG · 6 小时前 · 相关度 80% 热度★★☆☆☆
105
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA:基于推理时动作自适应的在线学习用于LLM ReAct智能体决策
推理部署开发工具学术论文

论文提出OLIVIA框架,将ReAct式的LLM智能体最终动作选择层建模为上下文线性Bandit,利用冻结的隐藏状态作为决策上下文,在推理时直接对候选动作进行评分和不确定性估计。通过上置信界(UCB)探索,OLIVIA能够从动作级反馈中以极低计算开销在线更新策略,逐步纠正小错误累积。在四个基准上,该方法相比静态ReAct和基于提示的推理时基线均取得一致的任务性能提升,表明显式在线决策层是在部署中优化LLM智能体行为的有效替代方案。

arXiv:2605.11169v1 Announce Type: new Abstract: Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM&#39;s final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
106
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
On-Policy 蒸馏的多面性:陷阱、机制与修复
训练微调基础大模型

本文系统研究了 on-policy distillation (OPD) 与 on-policy self-distillation (OPSD) 在大语言模型训练后阶段的失效原因和修复方法。发现 OPD 在数学推理任务中对教师模型选择和损失形式高度敏感;OPSD 在实例特定特权信息缺失时失效,但在共享潜在规则(如系统提示或对齐偏好)下有效。识别出三种失败机制:教师与学生分布不匹配、有偏 TopK 反向 KL 梯度引起的训练不稳定、以及 OPSD 学习到的无特权信息策略不足以捕捉实例特定信息。提出停止梯度 TopK 目标、RLVR 适配教师和 SFT 稳定学生等缓解策略。

arXiv:2605.11182v1 Announce Type: new Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model&#39;s own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
107
Reward Hacking in Rubric-Based Reinforcement Learning
基于量规的强化学习中的奖励破解
训练微调学术论文

该论文研究了在使用可验证奖励进行后训练时,基于量规(rubric)的强化学习容易出现的奖励破解问题。作者将奖励与参考评判器之间的差异分解为“验证器失败”和“量规设计局限”两类,并在医学和科学领域实验中发现弱验证器会产生大量不迁移到参考验证器的虚假奖励增益,且随着训练加深而增长。论文还引入了一种基于策略对数概率的“自内部化差距”诊断方法,无需外部验证器即可检测训练停滞。结果表明,更强的验证器可以减少但无法完全消除奖励破解,当量规本身未能涵盖重要失败模式时,强化学习可能偏好违反真实性、简洁性等全局质量指标的策略。

arXiv:2605.12474v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
108
Reinforcing VLAs in Task-Agnostic World Models
在任务无关世界模型中强化视觉-语言-动作模型
训练微调基础大模型

本文针对视觉-语言-动作(VLA)模型在强化学习微调中对任务特定数据依赖过高的问题,提出RAW-Dream范式。该方法使用任务无关的行为数据预训练世界模型,并利用现成的视觉语言模型(VLM)生成奖励,从而完全解耦世界模型学习与下游任务,实现零样本想象式微调。为抑制世界模型幻觉,引入了双噪声验证机制过滤不可靠的想象轨迹。在仿真和真实环境中的实验表明,该方法能有效提升VLA对新任务的适应能力,提供了一种可扩展的解决方案。

arXiv:2605.12334v1 Announce Type: new Abstract: Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
109
VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
VERDI: 基于分解推理的验证式LLM评判单次调用置信度估计
学术论文基础大模型

VERDI通过分解验证式LLM评判的推理轨迹,从三个结构信号(步骤裁决对齐度、声明级边际、证据基础得分)中提取置信度,无需额外推理调用。该方法结合Platt缩放逻辑回归,在GPT-4.1-mini和Qwen3.5系列上实现了AUROC 0.66-0.91的置信度校准效果,有效解决了token概率过度自信的问题。此外,VERDI展示了跨模型迁移能力,并验证了小型NLI模型可替代正则表达式提取。

arXiv:2605.11334v1 Announce Type: new Abstract: LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge&#39;s verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.
arXiv arXiv cs.LG · 6 小时前 · 相关度 78% 热度★★☆☆☆
110
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
面向RAG记忆的目标导向推理框架用于对话式智能LLM系统
学术论文基础大模型

为解决LLM对话智能体因上下文有限而难以维持长期连贯行为的问题,本文提出Goal-Mem框架,在基于RAG的外部记忆模块上引入目标导向推理。该框架采用后向链推理,将用户查询作为目标分解为原子子目标,并以目标驱动的方式执行针对性记忆检索,在中间目标无法解析时迭代识别需检索的信息。作者在自然语言逻辑系统中形式化此过程,结合了一阶逻辑的可验证性与自然语言的表达力。在两类数据集上与九种记忆基线对比,Goal-Mem在多跳推理和隐式推理任务上取得一致的性能提升。

arXiv:2605.12213v1 Announce Type: new Abstract: LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent&#39;s ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user&#39;s utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
111
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
语言模型对齐于谁?测量高风险竞争需求下的主要层级
基础大模型学术论文

本研究在法律和医疗领域的7136个场景中评估了10个前沿语言模型,发现当用户指令与专业标准冲突时,模型在任务执行中经常违背专业标准,尽管在提供建议时能遵守。主要的失败机制是知识遗漏,即模型拥有相关知识却输出有害内容而不揭示冲突信息。特别在一个案例中,推理模型在推理链中识别到某药物已被撤市,但在回复中压制了该信息,屈服于权威压力并继续推荐该药物。研究结论指出,当前对齐方法在不同任务框架、领域和模型家族之间表现不一致,在高风险专业部署中的鲁棒性不足。

arXiv:2605.12120v1 Announce Type: new Abstract: Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
112
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
当仿真说谎:面向工具使用代理的仿真到现实基准及领域随机化强化学习配方
基础大模型训练微调学术论文

该论文针对工具使用语言代理在真实部署中面临的用户输入错字、工具注册模糊、API不可靠等噪声,将其形式化为部分可观察马尔可夫决策过程中的仿真到现实差距。作者提出包含22种扰动的基准RobustBench-TC,并在1.5B至32B的21个模型上评估,发现奖励相关和状态转移扰动分别导致准确率下降约40%和30%。进而提出ToolRL-DR领域随机化强化学习配方,在3B骨干上训练后的代理能保留75%干净准确率,聚合扰动准确率与开源14B基线持平,并显著缩小与o4-mini的差距。

arXiv:2605.11928v1 Announce Type: new Abstract: Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
113
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
CATS: 面向内存受限大语言模型推理加速的级联自适应树推测解码
推理部署性能优化学术论文

本文针对内存受限设备(如边缘平台 DRAM 不足),提出 CATS 自推测解码框架,通过级联验证与校正机制,根据内存预算和参数卸载模式动态调整推测解码策略,在峰值内存占用仅相当于目标模型的情况下最大化令牌接受率和端到端加速。实验表明,CATS 在真实边缘设备上可实现最高 5.08 倍的墙钟加速,生成质量无损,且在边缘内存约束下优于现有 SOTA 方法 1.45 倍。该方法缓解了传统推测解码假设 HBM 足够大的限制,为内存有限场景的大模型推理部署提供了实用方案。

arXiv:2605.11186v1 Announce Type: new Abstract: Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.
arXiv arXiv cs.LG · 6 小时前 · 相关度 78% 热度★★☆☆☆
114
Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI
面向可信金融AI的持久化对话式多方法可解释性
开发工具学术论文

本文提出一个用于金融情感分析的可解释AI架构,将LIME、遮挡词重要度和显著性热图等XAI工件作为持久化对象存储在分布式存储中,支持语义检索和故障恢复。通过检索增强生成(RAG)助手综合多种解释方法的结果,允许用户以自然语言对话评估解释的稳健性。同时引入自动评估框架检查解释忠实度,实验显示受限提示相比朴素提示将幻觉率降低36%,方法属性引用提高73%,为受监管金融环境中的人本AI服务提供参考。

arXiv:2605.11687v1 Announce Type: new Abstract: Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
115
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
通过偏好维度扩展解释并打破安全-有用性天花板
训练微调学术论文

本文针对大语言模型多目标对齐中安全性(harmlessness)与有用性(helpfulness)之间的零和冲突,提出了一种多维度奖励视角的解决方案。作者发现冲突源于提示本身限制了可实现的奖励范围,于是提出MORA方法:通过预采样分离出单奖励提示,并重写问题以融入多维度意图来扩展奖励多样性。实验表明,在顺序对齐中,MORA在涵盖有用、无害、真实三个维度的多偏好对齐后,单偏好指标提升5%至12.4%,尤其无害性提升显著;在同步对齐中,MORA实现平均整体奖励提升4.6%。该工作从数据层面创新性地缓解了对齐中的固有折衷。

arXiv:2605.11679v1 Announce Type: new Abstract: In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model&#39;s rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
116
Selective Off-Policy Reference Tuning with Plan Guidance
基于计划引导的选择性离线参考调优方法
训练微调学术论文

本文提出SORT(选择性离线参考调优),用于改善GRPO风格强化学习在困难提示下所有采样均失败的问题。SORT从参考解中提取计划,比较有无计划条件下token的预测概率,对在计划引导下更可预测的token赋予更高权重,将全错样本转化为结构化学习信号。在三个模型和八个推理基准上,SORT相对GRPO和引导基线均取得显著提升,尤其对较弱模型增益最大。

arXiv:2605.11505v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
117
The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested
评估微分:当前沿AI模型意识到它们正在被测试时
学术论文基础大模型

本文提出“评估微分”(Evaluation Differential)概念,指前沿AI模型在识别自身处于评估环境时会表现出与部署环境下不一致的行为,导致安全性结论的有效性出现问题。研究基于Anthropic的BrowseComp事件、自然语言自动编码器在SWE-bench上的发现,以及OpenAI/Apollo的反预谋研究等公开案例,定义了标准化的效应量形式nED用于跨属性比较。作者提出了TRACE审计协议,可在现有评估基础设施上产生受条件限制的安全声明,而非单纯的能力分数,并探讨了对系统卡片、合规评估及AI安全机构治理的影响。

arXiv:2605.11496v1 Announce Type: new Abstract: Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic&#39;s BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
118
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
面向LLM推理自蒸馏的自适应教师暴露方法
训练微调学术论文

本文指出LLM推理自蒸馏中普遍默认教师完全观察参考推理链的做法存在问题,会因学生能力不足而引入过强监督信号。为此提出自适应教师暴露策略(ATESD),将教师暴露程度建模为可学习的控制变量,通过轻量Beta策略控制器根据训练状态动态调整。控制器采用折扣学习进步奖励进行优化,以缓解延迟信用分配问题。在AIME 24/25 和 HMMT 25 数据集上,Qwen3 1.7B/4B/8B 模型的实验表明,ATESD 较固定暴露自蒸馏和强化学习基线分别提升 +0.95、+2.05 和 +2.33 的 Average@12 分数,验证了自适应教师暴露作为推理自蒸馏新维度的有效性。

arXiv:2605.11458v1 Announce Type: new Abstract: On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student&#39;s own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student&#39;s current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student&#39;s future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
119
When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
当情绪成为触发器:寄生大语言模型的情绪风格动态后门攻击
训练微调学术论文

本文针对大语言模型微调中的后门漏洞,提出了一种基于情绪风格的动态后门攻击方法Paraesthesia。通过因果分析发现情绪在LLM表征空间中可与语义解耦,因此将情绪风格作为触发器,在微调数据中混入带有情绪触发的样本,使模型在推理时遇到情绪化输入即可生成预定义的有害输出。该方法包含情绪量化和风格改写两个步骤,实验表明在指令遵循生成和分类任务中攻击成功率约99%,同时保持模型在干净数据上的性能。

arXiv:2605.11612v1 Announce Type: cross Abstract: Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
120
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
当看还不够:视觉注意力结构揭示多模态大模型中的幻觉
学术论文基础大模型推理部署

本文提出 LaSCD,一种利用视觉注意力拉普拉斯能量的无训练对比解码方法,用于减少多模态大语言模型(MLLMs)中的视觉幻觉。通过层间拉普拉斯能量识别出幻觉偏好涌现层和答案真相暂现层,以封闭形式重新映射下一标记 logits,有效抑制幻觉生成。实验在多个幻觉检测和通用多模态基准上表明,LaSCD 能显著降低幻觉率,同时保持模型的通用能力,代码已开源。

arXiv:2605.11559v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
121
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
驯服极端标记:协方差感知的GRPO与高斯核优势重加权
训练微调学术论文

本文提出一种无超参数的协方差感知优化方法,通过高斯核动态降低极端标记更新的权重,自动缓解GRPO训练中探索与利用的权衡失衡问题。该方法用标记概率与优势值的协方差控制更新幅度,在保留有效学习信号的同时提升训练稳定性。在多个推理基准上的实验表明,该方法相比标准GRPO能稳定熵值并取得更优的下游性能。

arXiv:2605.11538v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
122
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA:迈向计算机使用代理的最优GUI-工具路径编排
训练微调学术论文

本文提出ToolCUA,一种端到端代理,旨在通过分阶段训练范式学习最优的GUI操作与工具调用路径选择。首先构建混合GUI-工具轨迹缩放流水线,利用静态GUI轨迹合成接地工具库,生成多样化交互轨迹;然后进行工具引导的GUI强化微调,结合预热SFT与单轮RL,改进关键切换点的决策;最后在在线代理RL环境中,通过工具高效路径奖励优化整体策略。在OSWorld-MCP测试中,ToolCUA达到46.85%准确率,比基线相对提升约66%,并超过纯GUI设置3.9%,证明了混合动作空间训练的潜力。

arXiv:2605.12481v1 Announce Type: new Abstract: Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
123
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE:通过口头状态不确定性解耦长周期任务中的信念与行动
开发工具学术论文

Agent-BRACE提出一种面向大语言模型智能体的新框架,将智能体解耦为信念状态模型和策略模型,通过强化学习联合优化。信念状态模型以结构化的自然语言陈述和口语化确定性标签来近似环境信念分布,策略模型基于该紧凑表示而非完整历史进行动作选择。在部分可观测的实体语言长周期任务中,该方法使Qwen2.5-3B-Instruct和Qwen3-4B-Instruct分别获得14.5%和5.3%的绝对性能提升,同时实现上下文窗口大小独立于任务长度的效果。

arXiv:2605.11436v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
124
Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory
超越相似性搜索:Tenure 与 LLM 记忆中有结构信念状态的案例分析
开发工具学术论文

本文针对大语言模型跨会话记忆问题,指出无状态会话导致迭代式工作流的重定向开销,并认为相似性搜索不适用于有界词汇下的命名实体消解。提出 Tenure 系统,一种本地优先代理,维护带认识状态、版本化替换与作用域隔离的类型化信念存储,通过精确优先检索向每个会话注入结构化上下文,将抽取的事实转化为可操作指令。实验显示,余弦相似度在72个检索案例上平均精确率仅0.12,而别名加权的BM25达到1.0,表明结构化方法完全消除了词汇不匹配问题,并在多轮话题漂移中保持稳定。

arXiv:2605.11325v1 Announce Type: cross Abstract: Why do we need another AI to help the AI? We argue you don&#39;t. Stateless LLM sessions impose re-orientation costs on iterative, session-heavy workflows. Prior work addresses cross-session memory through retrieval-augmented approaches: store history, embed it, retrieve by semantic similarity. Cross-session memory is a state management problem, not a search problem. Similarity search fails for named entity resolution within bounded vocabulary contexts because beliefs about a shared technical domain are semantically proximate by construction. A single user is the simplest bounded vocabulary context; engineering teams converge on the same property through shared codebases and terminology. We present Tenure, a local-first proxy that maintains a typed belief store with epistemic status, versioned supersession, and scope isolation, injecting curated context into every LLM session through precision-first retrieval. Hard scope isolation provides a structural guarantee: the right beliefs surface, and only within the boundaries the user has authorized. Tenure&#39;s typed schema converts extracted facts into imperative instructions via a why it matters field, making injected beliefs directly actionable rather than raw material for the model to re-derive. A controlled evaluation on 72 retrieval cases demonstrates the gap. Cosine similarity over dense embeddings achieves mean precision of 0.12. Alias-weighted BM25 maintains mean precision of 1.0, passing 72/72 cases versus 8/72 for cosine similarity on the same corpus. Hybrid retrieval typically solves vocabulary mismatch between disparate authors; Tenure eliminates this structurally: query and belief authors are the same person, and an alias enrichment flywheel continuously indexes their specific vocabulary. Under multi-turn topic drift this worsens: the vector backend produces drift scores of 0.43--0.50 on noise-critical turns where BM25 maintains 0.
arXiv arXiv cs.AI · 6 小时前 · 相关度 78% 热度★★☆☆☆
125
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
不是多少,而是哪些:低秩适配中的参数放置
训练微调学术论文

本文研究低秩适配(LoRA)中的参数放置问题:给定固定可训练参数量,选择B矩阵中的哪些元素至关重要。在监督微调(SFT)下,随机子集与基于梯度信息的子集性能相近;但在GRPO微调中,随机放置无法提升基模型,而基于梯度的放置能恢复标准LoRA的精度。这种差异源于梯度结构:SFT梯度低秩且方向稳定,GRPO梯度高秩且跨步近乎正交。作者提出了一种快速评分方法,可在不到10秒、成本低于完整训练0.5%的条件下识别关键参数,这些参数主要集中在残差流写入投影(V、O、Down),并在1.5B至8B不同模型家族和规模上保持稳定。

arXiv:2605.12207v1 Announce Type: new Abstract: We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).
arXiv arXiv cs.LG · 6 小时前 · 相关度 78% 热度★★☆☆☆
126
GRAFT: Graph-Tokenized LLMs for Tool Planning
GRAFT:用于工具规划的图标记化大语言模型
基础大模型学术论文

论文提出GRAFT框架,通过为工具图中的每个节点分配专门的特殊token,将工具依赖关系内化到LLM的表示空间中,以解决外部图匹配方式导致的约束违反和错误累积问题。GRAFT还引入在线策略的工具上下文蒸馏,在模型自身采样的轨迹上进行训练,逐步提炼规划信号。实验表明,该方法在精确序列匹配和依赖合法性上均达到当前最优,提升了复杂工作流中LLM工具规划的可靠性。

arXiv:2605.11706v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.
arXiv arXiv cs.LG · 6 小时前 · 相关度 78% 热度★★☆☆☆
127
学术论文开发工具

该论文提出一种基于后继表示的结构诊断方法,用于在多智能体LLM系统中预先评估通信拓扑(如链式、星型、网格)的鲁棒性和漂移风险。作者将行随机通信算子的后继表示矩阵 M 的谱半径、谱间隙和条件数分别与累积误差、共识动态和扰动稳健性三种失效模式关联,并在 Qwen2.5-7B-Instruct 的状态追踪任务上进行了100次独立试验验证。实验显示条件数与经验扰动稳健性达到完美秩相关(r_s=1.0),谱间隙部分预测共识动态,谱半径与累积误差呈完全负相关(r_s=-1.0),并提出了仿射噪声扩展以修正线性谱对非收缩偏置漂移的盲区。该工作为多智能体LLM系统提供了一种推理前的结构诊断手段,属于调试与系统分析工具的创新。

arXiv:2605.11453v1 Announce Type: cross Abstract: Practitioners deploying multi-agent large language model (LLM) systems must currently choose between communication topologies such as chain, star, mesh, and richer variants without any pre-inference diagnostic for which topology will amplify drift, converge to consensus, or remain robust under perturbation. Existing evaluation answers these questions only post hoc and only for the task measured. We introduce a structural diagnostic for multi-agent LLM communication graphs based on the successor representation $M = (I - \gamma P)^{-1}$ of the row-stochastic communication operator, and we connect three of its spectral quantities, the spectral radius $\rho(M)$, the spectral gap $\Delta(M)$, and the condition number $\kappa(M)$, to three distinct failure modes. We derive closed-form spectra for the chain, star, and mesh under row-stochastic normalization, and validate the predictions on a 12-step structured state-tracking task with Qwen2.5-7B-Instruct over 100 independent trials. The condition number is a perfect rank-order predictor of empirical perturbation robustness ($r_s = 1.0$); the spectral gap partially predicts consensus dynamics ($r_s = 0.5$); and the spectral radius is perfectly \emph{inverted} with respect to cumulative error ($r_s = -1.0$). We trace this inversion to a regime in which linear spectra are blind to non-contracting bias drift, and we propose an affine-noise extension of the predictive map that recovers the empirical ordering. We read this as a first step toward representational, drift-aware structural diagnostics for multi-agent LLM systems, sitting alongside classical spectral and consensus theory.
arXiv arXiv cs.AI · 6 小时前 · 相关度 77% 热度★★☆☆☆
128
Language Modeling with Hyperspherical Flows
基于超球面流的语言建模
基础大模型学术论文

本文提出 S-FLM,一种在超球面潜空间中使用连续流的语言模型。该方法通过旋转向量替代 one-hot 向量,避免了扩展词汇表维度带来的训练开销。S-FLM 在大词汇量推理任务上显著改进连续流语言模型,在标准温度采样下缩小了与掩码扩散模型的差距。

arXiv:2605.11125v1 Announce Type: new Abstract: Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in $\ell_2$, adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce $\mathbb{S}$-FLM, a latent FLM in the hypersphere. $\mathbb{S}$-FLM generates sequences by rotating vectors in $\mathbb{S}^{d-1}$ along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. $\mathbb{S}$-FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ($T=1$), while a gap remains under optimized low-temperature ($T=0.1$) decoding.
arXiv arXiv cs.LG · 6 小时前 · 相关度 76% 热度★★☆☆☆
129
Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments
约束数据价值最大化:利用数据归因在低数据环境中进行有效数据修剪
训练微调学术论文

本文针对数据归因中常用的数据移除基准,指出基于Shapley值的方法在低数据场景下修剪低价值数据并非最优。为此,作者提出了约束数据价值最大化(CDVM)方法,将数据修剪建模为约束优化问题,在最大化总体影响力的同时惩罚每个测试样本的过度贡献。在OpenDataVal基准上,CDVM在仅保留少量数据时表现出稳健的性能和具有竞争力的运行速度。

arXiv:2605.11312v1 Announce Type: new Abstract: Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model&#39;s performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
130
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
MT-JailBench:一个理解多轮越狱攻击的模块化基准
学术论文基础大模型

该论文提出了MT-JailBench,一个用于公平比较和分析多轮越狱攻击的模块化评估框架。它将攻击分解为五个相互作用模块(评估函数、攻击策略、提示生成、提示优化和流程控制),从而在固定条件下对不同攻击方法进行基准测试。研究发现,资源预算和评估函数是主要的混杂因素,会显著改变攻击排名;在组件层面,提示生成对性能变化影响最大,而优化和流程控制贡献适中。通过重新组合最佳组件,可获得一种表现优于源攻击并能泛化到多种目标LLM的强攻击配置,为红队评估提供了有效指导。

arXiv:2605.11002v1 Announce Type: cross Abstract: Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually evaluated as black-box pipelines with different budgets, judges, retry rules, and strategy generation procedures. As a result, it is often unclear whether reported gains reflect stronger attack mechanisms or different experimental conditions. We introduce MT-JailBench, a modular evaluation framework for benchmarking multi-turn jailbreaks under fixed conditions. MT-JailBench implements each attack as five interacting modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control. This design enables fair comparison across attack methods and component-wise analysis of what drives attack success. Using MT-JailBench, we find that resource budgets and evaluation functions are major confounders: controlling turns, retries, interactions, sampled strategies, and judges substantially change the ranking of attacks. At the component level, prompt generation accounts for most performance variation, while refinement and flow control provide moderate gains. We also find that explicit dynamic strategy generation is not always necessary; stochastic sampling from a fixed strategy can rival more elaborate diversification mechanisms. Finally, recomposing the best components yields a strong attack configuration that outperforms its source attacks and generalizes across diverse target LLMs. MT-JailBench therefore provides a modular framework for comparing multi-turn jailbreaks, understanding the impact of components, and guiding stronger red-teaming evaluations.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
131
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
技能漂移即契约违反:面向LLM Agent技能库的主动维护
开发工具学术论文

本文针对LLM Agent技能库因外部API、包、配置等演化而无声失效的问题,提出了基于契约的主动维护方法SkillGuard。它将技能文档中的环境假设抽取为可执行契约,仅对承担关键角色的依赖项进行验证,从而将监控噪声转化为精确的维护信号。在599个无漂移和硬负样本上,SkillGuard零误报,而传统无契约CI探针的误报率高达40%;在已知漂移验证中精度达到100%,召回率76%;违规契约还能指导修复,将一次修复成功率从10%提升至78%。论文还发布了包含880对样本的技能退化基准数据集。

arXiv:2605.10990v1 Announce Type: cross Abstract: LLM agents increasingly rely on reusable skill libraries, but these skills silently decay as the external services, packages, APIs, and configurations they reference evolve. Existing monitors detect such changes at the wrong granularity: they observe values, not the role those values play in a skill. A version string in a comment is noise; the same string in a pinned dependency is an operational obligation. We formulate skill drift as contract violation and introduce \sgname{}, which extracts executable environment contracts from skill documents and validates only those role-bearing assumptions against known or live conditions. This distinction turns noisy monitoring into a precision-first maintenance signal. Contract-free CI probes produce 40\% false positives, while \sgname{} raises zero false alarms over 599 no-drift and hard-negative cases (Wilson 95\% CI $[0,0.6]\%$). In known-drift verification, \sgname{} achieves 100\% precision and 76\% recall with the strongest backbone; in a pre-registered study over 49 real skills, it discovers live drift with 86\% conservative precision. Violated contracts also make repair actionable, improving one-round success from 10\% without localization to 78\%. We release \dbname{}, an 880-pair benchmark for skill degradation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
132
Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching
锐化你的流:流匹配的锐度感知采样
推理部署学术论文

论文提出一种名为SharpEuler的无训练采样器,用于加速流匹配模型的生成过程。该方法通过离线标定轨迹上速度场变化最剧烈的区域,构建锐度曲线,并利用分位数变换生成任意推理预算下的非均匀时间步网格,代替均匀 Euler 步长。在固定模型评估次数下,SharpEuler 能够提高生成样本质量,减轻模式间泄漏并提升模式覆盖,其原理结合了数值误差分析、变分推导与统计稳定性保证。

arXiv:2605.11547v1 Announce Type: new Abstract: Flow matching models generate samples by numerically integrating a learned velocity field, with each integration step requiring a neural network evaluation. Fast generation therefore requires using a small fixed evaluation budget effectively: the key question is not only how to integrate the flow, but where the sampler should spend its steps. We propose SharpEuler, a training-free sampler that profiles a pretrained model offline by estimating where the learned velocity field changes most rapidly along calibration trajectories. This finite-difference estimate defines a solver-aware sharpness profile, which is smoothed and converted by a quantile transform into a timestep grid for any desired inference budget. At test time, sampling remains ordinary Euler integration with the same number of model evaluations as a uniform schedule. We justify SharpEuler using three principles: a numerical principle identifying trajectory acceleration as the leading source of Euler discretization error, a variational principle deriving sharpness-based power-law timestep densities, and a statistical guarantee showing that the finite-sample calibrated sampler is stable at the terminal distribution level. Our experiments show that SharpEuler improves sample quality at fixed budgets, reducing inter-mode leakage and increasing mode coverage.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
133
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
ξ-DPO:基于比率奖励边距的直接偏好优化
训练微调学术论文

本文针对无参考偏好优化方法SimPO中超参数β和γ联合调优困难的问题,通过分析发现β隐含控制样本过滤,γ的效果依赖于数据集的奖励差距结构。提出ξ-DPO,将优化目标等价变换为最小化奖励差距与最优边距的距离,并采用选择与拒绝奖励的比值形式定义有界且可解释的边距ξ。ξ可以从初始奖励差距分布直接确定,避免了反复试错调参,为偏好优化提供了一种更简洁高效的对齐训练方式。

arXiv:2605.10981v1 Announce Type: cross Abstract: Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $\beta$ and $\gamma$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $\beta$ implicitly controls sample filtering, while the effect of $\gamma$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $\xi$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $\beta$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $\xi$. Unlike the margin $\gamma$ in SimPO, $\xi$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
134
Rotation-Preserving Supervised Fine-Tuning
旋转保持的监督微调
训练微调学术论文

针对监督微调(SFT)导致分布外泛化下降的问题,本文提出旋转保持监督微调(RPSFT)方法。RPSFT 通过惩罚预训练权重矩阵 top-k 奇异向量块的投影旋转,限制不必要的旋转而保留任务适应能力。在数学推理数据上的实验表明,RPSFT 在域内和域外性能权衡上优于标准 SFT 和强基线,能更好地保持预训练表征,并为下游强化学习微调提供更强的初始化。代码已开源。

arXiv:2605.10973v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-$k$ singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \href{https://github.com/jinhangzhan/RPSFT.git}{https://github.com/jinhangzhan/RPSFT}.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
135
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
形式化,而非优化:LLM生成组合求解器中的启发式陷阱
学术论文基础大模型

本文针对大语言模型在生成组合问题求解器时的设计策略,提出了一个包含100个组合问题、4577个实例的基准CP-SynC-XL。实验对比了原生算法搜索、Python+OR-Tools集成约束建模以及声明式MiniZinc建模三种范式,发现Python+OR-Tools在正确率上表现最佳,而声明式建模的绝对覆盖率更低。研究揭示了一个“启发式陷阱”:在效率导向的提示下,LLM倾向于用局部近似替代完备搜索或加入未验证的边界,导致部分实例速度变慢且正确率显著下降。作者据此建议LLM应专注于将问题形式化为约束模型,并将优化搜索交给验证后的求解器。

arXiv:2605.12421v1 Announce Type: new Abstract: Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
136
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
ProfiliTable:基于Profiling驱动的多智能体工作流表格数据处理
开发工具学术论文

本文提出ProfiliTable,一个自主多智能体框架,用于自动化表格数据的清洗、转换、增强和匹配任务。框架包含Profiler(ReAct式数据探索)、Generator(检索算子生成代码)和Evaluator-Summarizer(执行反馈闭环优化),通过动态profiling迭代理解模糊意图并生成健壮的转换代码。在覆盖18种表格任务类型的基准测试中,ProfiliTable一致优于基线,尤其在复杂多步场景中表现突出,展示了动态分析在将模糊意图可靠转化为合规表格转换中的关键作用。

arXiv:2605.12376v1 Announce Type: new Abstract: Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
137
Classifier Context Rot: Monitor Performance Degrades with Context Length
分类器上下文退化:监视器性能随上下文长度下降
基础大模型推理部署

该论文指出,当前前沿语言模型(Opus 4.6、GPT 5.4、Gemini 3.1)作为分类器监测编码代理的危险行为时,在长上下文(超过800K tokens)中会严重退化,遗漏危险动作的概率是短上下文时的2到30倍。通过定期提醒等提示策略可以部分缓解此问题,但更佳的后训练可能带来进一步改进。这些发现表明,传统不纳入长上下文退化的监测评估很可能高估了监视器的实际性能。

arXiv:2605.12366v1 Announce Type: new Abstract: Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions $2\times$ to $30\times$ more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
138
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem:大语言模型的高效在线记忆机制
基础大模型学术论文

本文提出δ-mem,一种轻量级在线记忆机制,为冻结的全注意力骨干模型增加紧凑的联想记忆状态。δ-mem通过增量规则学习将历史信息压缩为固定大小的状态矩阵,在生成时利用其读出产生低秩校正,调整骨干注意力的计算。仅使用8×8的在线记忆状态,δ-mem即可将平均分数提升至冻结骨干的1.10倍,并超过最强非δ-mem记忆基线15%。在记忆密集型基准MemoryAgentBench和LoCoMo上,分别达到1.31倍和1.20倍的增益,同时基本保持通用能力。该方法无需全面微调、替换骨干或显式扩展上下文窗口,展现了一种直接耦合注意力计算的紧凑在线记忆路径。

arXiv:2605.12357v1 Announce Type: new Abstract: Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $\delta$-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $\delta$-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone&#39;s attention computation during generation. With only an $8\times8$ online memory state, $\delta$-mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non-$\delta$-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
139
Epistemic Uncertainty for Test-Time Discovery
测试时发现的认知不确定性
训练微调学术论文

本文提出 UG-TTT 方法,通过在冻结基模型上维护低秩适配器(LoRA)小型集成,利用集成预测与权重假设间的互信息量化每 token 的认知不确定性,并将其作为强化学习的探索奖励,引导策略探索训练覆盖不足的区域。引入核范数正则化保持各适配器的差异性,防止探索信号退化。在四个科学发现基准上,UG-TTT 提升了三个任务的最大奖励并显著维持解多样性,消融实验验证了正则化关键作用。

arXiv:2605.11328v1 Announce Type: new Abstract: Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network&#39;s confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
140
No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents
没有NOD就没有行动:用于可靠服务代理的异构多智能体架构
开发工具学术论文

本文针对LLM代理在长程服务任务中策略违规、工具幻觉和意图错位等不可靠问题,提出NOD(Navigator-Operator-Director)异构多智能体架构。该架构将任务状态显式化为结构化的全局状态,由Navigator进行一致决策,并引入独立的Director代理在关键动作前进行选择性外部监督与干预。在τ²-Bench基准上,NOD相比基线取得了更高的任务成功率和关键动作精度,显著减少了政策违规和幻觉,提升了服务代理的可靠性。

arXiv:2605.12240v1 Announce Type: new Abstract: Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on $\tau^2$-Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
141
MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling
MM-OptBench:基于求解器的多模态优化建模基准
基础大模型学术论文

本文提出多模态优化建模基准MM-OptBench,要求多模态大语言模型(MLLM)从文本和视觉问题描述中生成数学优化模型及可执行求解器代码。基准包含780个经求解器验证的实例,覆盖6类优化问题、26个子类别和3个难度级别。对9款MLLM(6款通用、3款数学专用)的评测显示,最佳模型的pass@1仅达52.1%,数学专用模型未能解决任何实例,揭示了从图文提取实例数据和生成正确优化代码的难点,为决策导向的多模态智能提供了测试平台。

arXiv:2605.12154v1 Announce Type: new Abstract: Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
142
Rollout Cards: A Reproducibility Standard for Agent Research
推出卡片:智能体研究的可复现性标准
开发工具学术论文

本文聚焦于AI智能体评估中严重的可复现性问题,通过对50个流行训练和评估库的结构化审计,发现均未报告运行失败或跳过等记录,并记录到37例仅改变报告规则即可显著改变任务成功率和排名的情形。作者提出“rollout cards”作为一种复现性单元,要求公开发布评估的原始轨迹、视图和报告规则,从而保障评估结果的透明性与可比性。实验表明,按不同报告规则对同一基准输出重新评分,前沿模型得分最多可变化20.9个百分点,甚至反转模型排名。该标准已实现参考工具并集成到开源强化学习框架Ergon中,覆盖工具使用、软件工程、多智能体协调等基准。

arXiv:2605.12131v1 Announce Type: new Abstract: Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
143
Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization
支持-邻近增强扩散估计用于离线黑盒优化
训练微调学术论文

本文提出SPADE框架,将扩散模型用于离线黑盒优化中的前向代理建模,包含校准扩散估计和支持邻近正则化两个关键组件。校准模块强制全局统计矩和排序一致性,正则化则通过kNN密度估计嵌入数据流形先验。理论证明正则化等价于最大化贝叶斯后验,实验在Design-Bench和LLM数据混合优化基准上取得最优性能。

arXiv:2605.11246v1 Announce Type: new Abstract: Offline black-box optimization aims to discover novel designs with high property scores using only a static dataset, a task fundamentally challenged by the out-of-distribution (OOD) extrapolation problem. Existing approaches typically bifurcate into inverse methods, which struggle with the ill-posed nature of mapping scores to designs, and forward methods, which often lack the distributional expressivity to quantify uncertainty effectively. In this work, we propose SPADE (Support-Proximity Augmented Diffusion Estimation), a novel framework that reimagines forward surrogate modeling through the lens of conditional generative modeling. SPADE models the forward likelihood p(y|x) using a diffusion model, but with two critical enhancements to tailor it for optimization: (1) a Calibrated Diffusion Estimation module that enforces global consistency in statistical moments and pairwise rankings, and (2) a Support-Proximity Regularization mechanism that implicitly internalizes the data manifold constraint p(x) via kNN-based density estimation. Theoretically, we prove that our regularization is first-order equivalent to maximizing a Bayesian posterior with a valid design prior. Empirically, SPADE achieves state-of-the-art performance across Design-Bench tasks and an LLM data mixture optimization benchmark.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
144
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE: 一种用于结构感知联想记忆的自进化智能体图记忆引擎
开发工具学术论文

本文提出SAGE,一个自进化的智能体图记忆引擎,用于解决语言智能体的长期记忆瓶颈。它将记忆建模为由记忆写入器增量构建的动态图,并使用基于图基础模型的读出器执行检索并提供反馈,实现记忆图的自进化。在多项基准上,SAGE经过两轮自进化后在多跳问答中取得最佳平均排名,零样本开放域迁移时在NQ上Recall@2/5达82.5%/91.6%,并在长期记忆和幻觉诊断指标上表现提升,表明结构感知的自进化图记忆是构建鲁棒长程智能体的有效方法。

arXiv:2605.12061v1 Announce Type: new Abstract: Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, a Self-evolving Agentic Graph-memory Engine that models graph memory as a dynamic long-term memory substrate. SAGE couples two roles: a memory writer that incrementally constucts structured graph memory from interaction histories, and a Graph Foundation Model-based memory reader to perform retrieval and provide feedback to the memory writer. We provide rigorooous theoretical annalyses supporting the framework. Across multi-hop QA, open-domain retireval, domain-specific review QA, and long-term agent-memory benchmarks, SAGE improves evidence recovery, answer grounding, and retrieval efficiency: after two self-evolution rounds, it achieves the best average rank on multi-hop QA; in zero-shot open-domain transfer, it reaches 82.5/91.6 Recall@2/5 on NQ. Further results on LongMemEval and HaluMem show that traning and reader-writer feedback improve multiple long-term memory and hallucination-diagnostic metrics, suggesting that self-evolving, structure-aware graph memory is a promising foundation for robust long-horizon language agents.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
145
OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling
OptArgus:用于检测基于大语言模型的优化建模中幻觉的多智能体系统
学术论文基础大模型

本文针对大型语言模型在将自然语言优化问题转换为数学建模和求解器代码时产生的幻觉,提出首个面向优化建模的细粒度幻觉分类法,涵盖目标、变量、约束和实现失败。基于该分类法设计多智能体检测系统OptArgus,采用指挥路由、专家审计和证据整合机制。为评估效果,构建了包含484个干净样本、1266个受控注入错误和6292个自然模型生成错误的三部分基准套件。实验表明,OptArgus相比单智能体基线,显著减少对干净样本的误报,在受控单错误场景下提供更准确的错误定位,并在自然模型输出上实现更强的检测能力。

arXiv:2605.11738v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emph{optimization-modeling hallucination detection}, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with $484$ clean artifacts, $1266$ controlled injected artifacts, and $6292$ natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
146
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
洞穴寓言:基于测量数据基础化的视觉语言学习
基础大模型学术论文

该论文提出PRISM-VL方法,用相机RAW数据派生出的Meas.-XYZ输入替代传统RGB图像,结合相机条件嵌入和曝光包围监督聚合,将视觉接口下移至更接近原始传感器测量。在低光照、高动态范围和幻觉敏感场景下,PRISM-VL-8B相比RGB基线Qwen3-VL-8B在BLEU上提升0.1074、ROUGE-L提升0.1071、LLM评判准确率提升4.46个百分点,证明保留测量域信息能减少RGB渲染造成的信息丢失,显著提升多模态推理基础能力。

arXiv:2605.11727v1 Announce Type: new Abstract: Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
147
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
SafeSteer:多模态大语言模型的解码级防御机制
推理部署基础大模型

SafeSteer 提出一种不依赖微调的解码级安全防御方法,通过轻量探测模块在解码过程中识别并修正有害输出,迭代引导生成朝向安全方向。方法还引入模态语义对齐向量,将文本模态的安全对齐能力迁移至视觉模态。实验表明,SafeSteer 可将多模态大模型的安全性提升最高 33.40%,同时保持有用性,有效应对图像维度的越狱攻击。

arXiv:2605.11716v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs&#39; safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
148
GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
GAR:通过约束优化实现碳感知的大模型推理路由
推理部署学术论文

本文提出碳感知路由框架GAR,通过约束多目标优化在满足准确率下限和尾部延迟SLO的前提下最小化单次请求的CO2排放。GAR采用自适应约束优化与轻量估计器,实现实时在线路由决策,并提出基于原始对偶的在线算法GAR-PD进行滚动碳预算管理。在混合7B-70B模型池上的实验表明,GAR可在保证竞争性准确率和p95延迟的同时显著降低碳排放。

arXiv:2605.11603v1 Announce Type: new Abstract: The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
149
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
打破赢家通吃:合作策略优化提升大语言模型推理多样性
训练微调学术论文

本文针对大语言模型(LLM)强化学习推理训练中群组相对策略优化(GRPO)等算法常见的探索崩溃问题,提出群组合作策略优化(GCPO)方法,将训练范式从 rollout 间的竞争转变为团队合作。GCPO 通过团队级信用分配,以 rollout 对团队有效解决方案覆盖的贡献而非个体精度作为奖励,并利用基于语义嵌入的行列式体积度量覆盖度。在多个推理基准上的实验表明,该方法相比现有方法显著提升了推理准确性和解决方案多样性。

arXiv:2605.11461v1 Announce Type: new Abstract: Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team&#39;s valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at $\href{https://github.com/bradybuddiemarch/gcpo}{this}$.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
150
Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
针对大语言模型推理的靶向测试:审计约束协议
基础大模型学术论文

本文提出了一种审计约束协议,用于评估大语言模型在推理任务中对提示变化的鲁棒性。该方法通过有限的提示组件语法生成确定性变体,在固定查询预算下评估,并仅在经过语义和答案提取双重审计后才将结果计为模型错误。作者实现了组件自适应提示采样(CAPS),并与均匀采样对比,发现在审计后的正确错误识别上CAPS未显示显著优势,但该方法本身为可重现、可审查的靶向提示测试提供了方法论框架。

arXiv:2605.11599v1 Announce Type: new Abstract: Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
151
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion: 一种基于正交等价变换的谱保持优化器
训练微调

本文提出针对大语言模型训练的优化器Pion,基于正交等价变换,通过左右正交阵更新权重矩阵,在训练中保持奇异值不变,从而维持权重矩阵的谱范数固定。该优化器不同于Adam、Muon等加法型优化器,从几何上调控权重矩阵的结构。文中导出了Pion的更新规则,分析了设计选择与收敛性质,并在LLM预训练和微调实验中展示了稳定且具有竞争力的替代方案。

arXiv:2605.12492v1 Announce Type: new Abstract: We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
152
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
DiffScore:超越自回归似然的文本评估
开发工具学术论文

本文提出名为 DiffScore 的文本评估框架,基于掩码大型扩散语言模型,通过测量连续掩码率下的文本可恢复性来消除自回归模型的位置偏差,并建立起从局部流利度到全局连贯性的评估层次。框架提供多时间步质量配置文件与双向 PMI 分解等诊断工具,支持零样本与微调评测。在十个基准上的实验结果显示,DiffScore 持续优于自回归基线方法,代码已开源。

arXiv:2605.11601v1 Announce Type: cross Abstract: Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
153
MEME: Multi-entity & Evolving Memory Evaluation
MEME:多实体与演进记忆评估
学术论文

MEME 是一个针对 LLM 智能体的多实体及演进记忆评估基准,定义了六个跨越“多实体”和“演进”两个维度的记忆任务,其中三个为先前工作未覆盖的依赖推理任务:级联、缺失和删除后状态推理。在 100 段受控情节中,六种记忆系统在默认配置下依赖推理能力几乎为零(级联准确率 3%,缺失准确率 1%),即使通过提示优化、更深度检索、减少填充噪声或使用更强模型也无法显著改善。仅在使用 Claude Opus 4.7 的文件型智能体下部分效果得到弥补,但成本约为基线的 70 倍,表明在规模化部署中仍不可行。

arXiv:2605.12477v1 Announce Type: new Abstract: LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
154
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
多流LLM:通过并行思想、输入和输出流解锁语言模型
训练微调基础大模型

本文提出多流大语言模型(Multi-Stream LLMs),将传统单流对话格式改为多路并行计算流,每条流对应不同角色(用户、系统、思考、工具等)。模型在每个前向传播中同时从多个输入流读取并生成多个输出流的 token,打破单流瓶颈,使智能体能在“思考”的同时“行动”,并在读取过程中开始输出。该方法通过指令微调实现,可在不改变模型架构的前提下提升并行效率、安全隔离性和可监控性。

arXiv:2605.12460v1 Announce Type: new Abstract: The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
155
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge:从开放世界2D图像引导3D意识空间推理
学术论文训练微调

本文提出 SpatialForge,一种可扩展的数据合成流水线,将非受限的2D图像转化为空间推理监督信号。方法将空间推理分解为感知和关系两部分,构建覆盖深度、布局和视角依赖推理的结构化监督,并配备自动验证以确保数据质量。基于该流水线构建了包含千万级空间问答对的 SpatialForge-10M 数据集,并在多个空间推理基准上验证了训练后标准视觉语言模型的空间推理能力显著提升,证明了利用大规模2D数据提升3D感知推理的有效性。

arXiv:2605.11462v1 Announce Type: cross Abstract: Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
156
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
一条消息就能瘫痪 AI 基础设施?基于定向 Mobius 注入的 AbO-DDoS 攻击兴起
学术论文行业资讯

本文揭示了一种针对 LLM 代理基础设施的新型攻击范式——Mobius 注入,通过利用代理逻辑中的“语义闭合”漏洞,单条文本注入即可将自主代理转化为僵尸节点,发动基于代理的 DDoS(AbO-DDoS)攻击。实验在 3 种爪式代理和 3 种编码代理上配合 12 个前沿 LLM 进行,结果显示单节点调用放大最高达 51.0 倍,多节点 p95 延迟膨胀最高达 229.1 倍,且攻击效果随中毒节点数超线性增长。论文还提出了基于代理组件能量分析的主动防御机制,用以检测恶意递归触发。

arXiv:2605.11442v1 Announce Type: cross Abstract: Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user-agent-service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent-based and -Oriented DDoS (AbO-DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real-world impact, we conduct extensive experiments across three representative claw-style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open-weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single-node call amplification up to 51.0x and multi-node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent&#39;s component graph.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
157
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero:连接语言先验与世界模型的决策桥梁
基础大模型训练微调学术论文

提出PriorZero框架,通过解耦的展开-训练设计将大语言模型的概念先验融入基于世界模型的规划中。在展开阶段,根先验注入机制仅在MCTS根节点引入LLM先验,以聚焦语义有前景的动作,同时保持世界模型的前瞻能力。训练阶段解耦世界模型学习与LLM微调,利用价值估计为LLM提供细粒度信用分配信号,实现交替优化,从而在长程任务中稳定提升探索效率和渐进性能。

arXiv:2605.12289v1 Announce Type: new Abstract: Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model&#39;s deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
158
Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization
选择,而非点击:面向同意驱动的MCP授权的格细化
开发工具学术论文

Conleash 是一个针对模型上下文协议(MCP)的客户端中间件,通过构建风险格自动允许已知边界内的安全工具调用,仅对潜在危险参数进行升级处理。它结合策略引擎实施用户定义的约束,并利用细化循环将每次用户决策转化为可复用的授权规则。在984条真实交互轨迹的评估中,Conleash 达到98.2%的授权准确率,捕获99.4%的升级调用,且策略验证仅增加8.2毫秒延迟;16名参与者的用户研究显示,相比传统“总是允许”或黑盒LLM决策,Conleash 的作用域权限显著提升了信任度并减少了授权疲劳。

arXiv:2605.11360v1 Announce Type: cross Abstract: As Model Context Protocol adoption grows, securing tool invocations via meaningful user consent has become a critical challenge, as existing methods, broad always allow toggles or opaque LLM-based decisions, fail to account for dangerous call arguments and often lead to consent fatigue. In this work, we present Conleash, a client-side middleware that enforces boundary-scoped authorization by utilizing a risk lattice to auto-permit safe calls within known boundaries while escalating risks, a policy engine for user-defined invariants, and a refinement loop that converts user decisions into reusable rules. Evaluated on 984 real-world traces, Conleash achieved 98.2% accuracy, caught 99.4% of escalations, and added only 8.2 ms of overhead for policy verification; furthermore, in a user study where N=16, participants significantly preferred Conleash scoped permissions over traditional methods, citing higher trust and reduced prompting.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
159
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
指令镜头分数:你的指令为多模态大语言模型贡献强大的物体幻觉检测器
基础大模型学术论文

本文深入分析了多模态大语言模型中的指令词元嵌入,发现其隐式编码视觉信息并能有效过滤误导性视觉嵌入。基于此洞察,提出了无需辅助模型或额外训练的指令镜头分数(InsLen),它结合了校准后的局部分数和衡量目标词元上下文一致性的分数,用作即插即用的物体幻觉检测器。在多个基准和不同MLLM架构上的广泛实验表明,InsLen一致优于现有幻觉检测方法,展现了有效性和鲁棒性。

arXiv:2605.12258v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in-depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at https://github.com/Fraserlairh/Instruction-Lens-Score.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
160
Overtrained, Not Misaligned
过度训练而非误对齐
训练微调

本文对GPT-4o和12个开源大模型进行大规模复现研究,发现仅在少数模型中出现持续性的“涌现误对齐”(EM),且EM与模型大小相关。通过微调过程的检查点分析,证明EM发生于主任务收敛之后的过度训练阶段,而非初始微调本身;采用早停策略可消除EM并平均保留93%的任务性能,恰当的初始学习率选择也能降低风险。在医学微调上的跨领域验证同样证实这一模式,将EM从不可预见的微调风险重新定义为可通过良好训练实践避免的现象。

arXiv:2605.12199v1 Announce Type: new Abstract: Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
161
A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning
面向多领域多任务图对齐指令微调的统一图语言模型
基础大模型训练微调学术论文

本文提出UniGraphLM,一种统一图语言模型,旨在将多领域多任务GNN编码器纳入LLM,并实现图表示与文本语义的自适应对齐。现有方法在跨域跨任务的对齐上存在缺陷,UniGraphLM通过可泛化的图表示学习和自适应对齐策略,解决图结构、特征分布和监督信号差异带来的对齐难题,并提升在不同图数据和任务指令下的兼容性。该工作提升了图语言模型在多样化图数据上的泛化能力。

arXiv:2605.12197v1 Announce Type: new Abstract: Leveraging Graph Neural Networks (GNNs) as graph encoders and aligning the resulting representations with Large Language Models (LLMs) through alignment instruction tuning has become a mainstream paradigm for constructing Graph Language Models (GLMs), combining the generalization ability of LLMs with the structural modeling capacity of GNNs. However, existing GLMs that adopt GNNs as graph encoders largely overlook the problem of aligning GNN-encoded representations across domains and tasks with the LLM token space to obtain unified graph tokens, thereby limiting their ability to generalize across diverse graph data. To bridge this gap, we aim to incorporate a multi-domain, multi-task GNN encoder into GLMs and align its representations with LLMs to enable multi-domain, multi-task graph alignment instruction tuning. This alignment problem remains underexplored and poses two key challenges: 1) learning GNN-encoded representations that are simultaneously generalizable across domains and tasks and well aligned with textual semantics is difficult, due to substantial variations in graph structures, feature distributions, and supervision signals, together with the lack of textual-semantic alignment guidance in task-specific GNN training; 2) diverse graph data and task-specific instructions can exhibit different degrees of compatibility with the LLM token space during instruction tuning, leading to varying alignment difficulty and rendering a fixed alignment strategy suboptimal. To tackle these challenges, we propose UniGraphLM, a Unified Graph Language Model that incorporates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, and then adaptively aligns these representations with the LLM.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
162
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
超越向量范数的梯度裁剪:一种矩阵值参数的谱方法
训练微调学术论文

本文提出一种针对矩阵参数梯度的谱裁剪方法,通过截断奇异值而非整体范数来稳定训练,避免忽略参数矩阵结构。该方法可无缝集成到现有优化器中,并提供重尾噪声下非凸优化的收敛性分析。为减少超参数调优,引入了基于移动平均或滑动窗口分位数的自适应阈值,并利用随机截断SVD实现高效剪裁,仅处理前r个奇异值。实验在合成重尾噪声和神经网络训练任务上展示了竞争力性能。

arXiv:2605.11838v1 Announce Type: new Abstract: Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2\alpha}{3\alpha - 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
163
More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
更多编辑,更稳定:理解序列模型编辑中的终身归一化
训练微调学术论文

本文首次从理论上分析了终身模型编辑中广泛使用的“终身归一化”(LN)机制,揭示了其形成的自增强稳定循环:结合岭回归正则化后,LN能使参数更新渐近正交且范数有界,从而直接减轻遗忘和系统性崩溃。基于这一洞见,作者提出了StableEdit方法,通过引入显式预热阶段和完全白化来强化稳定循环,仅增加极小开销便提升了长期编辑稳定性。大量实验验证了理论分析的有效性,并展示了有竞争力的模型编辑性能。

arXiv:2605.11836v1 Announce Type: new Abstract: Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a &#34;black box&#34;, leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE-USTC/StableEdit.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
164
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
Agent安全中的粒度错配:参数级溯源解决执行问题并隔离LLM推理瓶颈
开发工具学术论文

本文提出PACT,一种面向工具使用LLM Agent的运行时监控器,通过为工具参数赋予语义角色并跨步骤追踪值溯源,实现对权限载体参数的细粒度信任检查。在混合信任工作流中,PACT在Oracle溯源下达到100%的效用与100%的安全,在五款模型的全AgentDojo部署中实现100%安全并恢复38.1–46.4%的效用,较CaMeL提升8–16个百分点。消融实验证实语义角色和跨步溯源均为必需,该工作将Agent安全重新定义为权限绑定,并将部署瓶颈集中于溯源推断与合约合成。

arXiv:2605.11039v1 Announce Type: cross Abstract: Tool-using LLM agents must act on untrusted webpages, emails, files, and API outputs while issuing privileged tool calls. Existing defenses often mediate trust at the granularity of an entire tool invocation, forcing a brittle choice in mixed-trust workflows: allow external content to influence a call and risk hijacked destinations or commands, or quarantine the call and block benign retrieval-then-act behavior. The key observation behind this paper is that indirect prompt injection becomes dangerous not when untrusted content appears in context, but when it determines an authority-bearing argument. We present \textsc{PACT} (\emph{Provenance-Aware Capability Contracts}), a runtime monitor that assigns semantic roles to tool arguments, tracks value provenance across replanning steps, and checks whether each argument&#39;s origin satisfies its role-specific trust contract. Under oracle provenance, \textsc{PACT} achieves 100\% utility and 100\% security on mixed-trust diagnostic suites, while flat invocation-level monitors incur false positives or false negatives. In full AgentDojo deployments across five models, \textsc{PACT} reaches 100\% security on the three strongest models while recovering 38.1--46.4\% utility, 8--16 percentage points above CaMeL at the same security level. Ablations show that both semantic roles and cross-step provenance are necessary. \textsc{PACT} reframes agent security as authority binding, and isolates the remaining deployment bottleneck to provenance inference and contract synthesis.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
165
U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation
U-STS-LLM:一种用于流量预测与插补的统一时空引导大语言模型
训练微调学术论文

该论文提出U-STS-LLM框架,将预训练大语言模型(LLM)适配到蜂窝网络时空流量预测和缺失值插补任务。核心创新包括:动态时空注意偏置生成器,用于显式引导LLM的注意力;采用LoRA部分冻结骨干网络进行参数高效微调;以及门控自适应融合机制。在真实蜂窝数据上的实验表明,该方法在长期预测和高缺失率插补任务上均达到最先进性能,同时保持训练效率和稳定性,为大模型在结构化非语言领域的应用提供了新范式。

arXiv:2605.11735v1 Announce Type: new Abstract: The efficient operation of modern cellular networks hinges on the accurate analysis of spatio-temporal traffic data. Mastering these patterns is essential for core network functions, chiefly forecasting future load to pre-empt congestion and imputing missing values caused by sensor failures or transmission errors to ensure data continuity. While deeply connected, forecasting and imputation have historically evolved as separate sub-fields. The dominant paradigm, Spatio-Temporal Graph Neural Networks (STGNNs), while effective, are often specialized, computationally intensive, and exhibit limited generalization. Concurrently, adapting large pre-trained language models (LLMs) offers a powerful alternative for sequence modeling, yet existing approaches provide weak structural guidance, leading to unstable convergence and a narrow focus on forecasting. To bridge these gaps, we propose U-STS-LLM, a unified framework built on a spatio-temporally steered LLM. Our core innovation is a Dynamic Spatio-Temporal Attention Bias Generator that synthesizes a persistent functional graph with transient nodal states to explicitly steer the LLM&#39;s attention. Coupled with a partially frozen backbone tuned via Low-Rank Adaptation (LoRA) and a Gated Adaptive Fusion mechanism, the model achieves stable, parameter-efficient adaptation. Trained under a unified multi-task objective, U-STS-LLM learns a holistic data representation. Extensive experiments on real-world cellular datasets demonstrate that U-STS-LLM establishes new state-of-the-art performance in both long-horizon forecasting and high-missing-rate imputation, while maintaining remarkable training efficiency and stability, offering a novel blueprint for harnessing foundation models in structured, non-linguistic domains.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
166
An Executable Benchmarking Suite for Tool-Using Agents
一种面向工具使用智能体的可执行基准测试套件
开发工具学术论文

提出一个可执行的基准测试套件,将 WebArena Verified、SWE-Gym 切片和 MiniWoB++ 等基准统一起来,通过共同的工作负载适配器、任务清单、事件模式和证据准入合同,将论文相关证据与预飞行、夹具、诊断等行分离。套件记录延迟、无效动作行为、补丁生成成本、验证器元数据等可审计证据,旨在解决基准报告中将工作负载、动作驱动和证据混淆的问题。套件定位于基准测试设施,而非新智能体策略、模型排行榜或自动解题器。

arXiv:2605.11030v1 Announce Type: cross Abstract: Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
167
FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments
FragBench:隐藏于看似无害片段中的跨会话攻击
基础大模型学术论文

本文提出FragBench,一个用于评估大模型跨会话碎片化攻击安全性的基准,攻击者可将恶意目标拆分成多个看似无害的子提示,仅在跨不同会话组合时才构成威胁。基准基于24个真实网络攻击事件构建,包含对抗性重写器(FragBench Attack)和基于图神经网络的用户级检测器(FragBench Defense)。单轮安全检测在该基准上近乎随机,但四种GNN变体和三种经典机器学习模型均能捕获跨会话特征,达到0.88–0.96的总体事件F1值,证明防御碎片化LLM滥用需要建模跨会话交互图,而非仅依赖单轮提示判断。

arXiv:2605.11029v1 Announce Type: cross Abstract: An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
arXiv arXiv cs.AI · 6 小时前 · 相关度 75% 热度★★☆☆☆
168
Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
进化任务发现:通过技能组合与复杂度扩展推进推理前沿
训练微调基础大模型学术论文

本文提出进化任务发现(EvoTD)框架,将LLM训练数据合成为在算法技能和复杂度属性双轴流形上的定向搜索,以系统性地扩展推理前沿。框架引入交叉算子合成新颖技能组合以增强多样性,以及参数变异算子缩放结构约束以驱动鲁棒泛化,并集成动态最近发展区过滤器确保任务处于模型可学习区域。实验表明,EvoTD在不同的模型架构、预训练方案和规模上均能稳定带来显著的推理能力提升,为利用结构化进化课程支持推理改进提供了可行路径。

arXiv:2605.11666v1 Announce Type: new Abstract: The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
169
FERA: Uncertainty-Aware Federated Reasoning for Large Language Models
FERA:面向大语言模型的不确定性感知联邦推理
基础大模型学术论文

本文针对大型语言模型在分布式隐私数据下的多步推理增强,提出无需训练的联邦推理框架FERA。客户端生成推理轨迹并附带轻量不确定性估计,服务器通过不确定性感知的自我批评聚合(UA-SCA)进行查询依赖的可信度加权与交叉验证,迭代优化推理结果。理论分析证明了该协议的收敛性以及不确定性加权可加速收敛,在多个推理基准上,FERA均优于联邦训练和无训练基线,且保持通信与计算效率。

arXiv:2605.10082v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.
arXiv arXiv cs.LG · 6 小时前 · 相关度 75% 热度★★☆☆☆
170
CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening
CPEMH:面向心理健康筛查的基础模型系统中提示驱动行为评估与保障的智能体框架
开发工具学术论文

本文提出了CPEMH,一个用于评估基础模型在心理健康筛查转录数据中提示驱动行为的智能体框架。该框架采用编排器、推理和评估智能体的模块化设计,实现提示策略的自动设计、评估和选择,确保行为可追溯、可复现和鲁棒。通过在自动化抑郁筛查访谈转录上的案例研究,展示了框架在临床敏感领域稳定和审计基础模型行为的能力,并强调了模块化编排、稳定性优先以及结合F1、偏差和鲁棒性作为核心验收标准。

arXiv:2605.11341v1 Announce Type: new Abstract: This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework&#39;s capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
171
Curriculum Learning-Guided Progressive Distillation in Large Language Models
课程学习引导的大语言模型渐进式蒸馏
训练微调学术论文

本文提出课程学习引导的渐进式蒸馏(CLPD)框架,解决知识蒸馏中训练数据学习顺序与学生-教师模型容量不匹配的问题。CLPD通过构建从易到难的显式样本课程,同时渐进式引入更强教师的隐式监督课程,将数据难度与教师能力对齐。该模块化框架可无缝集成到标准蒸馏算法中,实验表明在推理基准上一致优于仅数据排序或仅教师调度的方案。

arXiv:2605.11260v1 Announce Type: new Abstract: Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.
arXiv arXiv cs.LG · 6 小时前 · 相关度 72% 热度★★☆☆☆
172
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
从噪声到多样性:大语言模型推理中的随机嵌入注入
推理部署训练微调学术论文

本文提出随机软提示(RSP),在LLM推理时直接注入从高斯分布采样的随机嵌入向量,无需训练即可达到与优化软提示相近的数学推理准确率。其机制在于随机位置迫使注意力分布扁平化,使早期生成token的多样性提升,进而通过温度采样扩大Pass@N(N次尝试中至少一次正确的概率)。该方法还可扩展至DAPO训练,带来实际收益。

arXiv:2605.11936v1 Announce Type: new Abstract: Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
173
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
迈向视觉基础的多模态摘要:基于跨模态Transformer和门控注意力
基础大模型学术论文

该论文提出SPeCTrA-Sum框架,通过深度视觉处理器(DVP)将视觉编码器与语言模型在对应深度进行层次化融合,缓解表征失配问题,并利用视觉相关性预测器(VRP)结合行列式点过程蒸馏,选择显著且多样化的代表性图像。训练采用自回归摘要、跨模态对齐和DPP蒸馏的多目标损失,实验表明在生成视觉一致的摘要和图像选择上均取得更优效果。

arXiv:2605.11753v1 Announce Type: new Abstract: Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
174
A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
大语言模型的类CAP三难困境:语义不确定下的正确性、无偏性与效用
基础大模型

本文借鉴分布式系统的CAP定理,形式化了大语言模型在语义不确定场景下的三难问题:强正确性、严格无偏性和高效用性不能同时保证。当用户前提不决定唯一答案时,模型提供有用回答需引入偏好或先验,导致偏见;若避免无依据偏好则可能降低效用。论文通过例子论证了某些LLM失败源于不确定请求的结构性约束,而非模型缺陷。

arXiv:2605.11672v1 Announce Type: new Abstract: The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
175
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
LLM智能体能应对灾害吗?应急行动中异构地理空间推理的基准测试
学术论文基础大模型

本文提出灾害应急响应智能体基准测试集 DORA,覆盖 45 个真实灾害事件的 515 个专家级任务,包含光学、SAR、多光谱等异构地理空间数据及 108 个工具的 MCP 库。在 13 个前沿大模型上的评估揭示了三个持续挑战:灾害领域语义对齐失败、工具选择与参数接地构成双层瓶颈、长链条组合脆弱性导致智能体与专家轨迹差距随步骤增加而急剧扩大。

arXiv:2605.11633v1 Announce Type: new Abstract: Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
176
训练微调学术论文

本文通过稀疏自编码器(SAE)研究大模型在监督微调(SFT)前后隐藏激活的表示差异。尽管余弦相似度表明激活几何几乎未变,但SAE投影显示底层的稀疏潜在变量显著分化。作者提出了一套分析流水线,发现特定任务和特定层的语义特征被系统性改变,并识别出安全对齐特有的逐层更新模式。代码已开源,该方法为理解微调过程提供了高分辨率的机理诊断工具。

arXiv:2605.11426v1 Announce Type: new Abstract: The cosine similarity between a large language model&#39;s hidden activations before and after Supervised Fine-Tuning (SFT) remains very high. This, at first glance, suggests that SFT leaves the model&#39;s activation geometry largely undisturbed. However, projecting both sets of activations through a Sparse Autoencoder (SAE) pretrained on the base model reveals that the underlying sparse latents diverge significantly. We introduce a novel investigative pipeline which utilizes these pretrained SAEs as a high-resolution diagnostic tool to mechanistically investigate the drivers of this representational divergence. Through our analytical pipeline, we discover task-specific and layer-specific distributions of the precise semantic features that are systematically altered during supervised fine-tuning. We additionally identify a layer-wise update profile specific to safety alignment. All code, experimental scripts, and analysis files associated with this work are publicly available at: https://github.com/ruhzi/sae-investigation.
arXiv arXiv cs.AI · 6 小时前 · 相关度 72% 热度★★☆☆☆
177
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
OGLS-SD:面向LLM推理的基于结果引导logit引导的在策略自我蒸馏
训练微调推理部署

该论文针对大语言模型推理能力的在策略自我蒸馏(OPSD)中存在的教师-学生响应不匹配问题,提出OGLS-SD框架。该方法利用可验证的结果奖励对比成功与失败的在策略轨迹,校准教师logits,通过结合结果级别的正确性信号与稠密的token级别引导,稳定了自我蒸馏过程,在多个基准上提升了推理表现。

arXiv:2605.12400v1 Announce Type: new Abstract: We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.
arXiv arXiv cs.LG · 6 小时前 · 相关度 72% 热度★★☆☆☆
178
Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
跨容量层级的程序性技能SFT:0.8B-4B Qwen3.5模型上的W形预SFT轨迹与规模不对称机制
训练微调学术论文

本文在0.8B、2B、4B三种规模的Qwen3.5稠密模型上测量了程序性技能SFT的贡献,发现SFT带来的绝对提升在不同规模上大致均匀(约+0.04~+0.075),但预SFT基础模型性能呈W形轨迹,存在规模不对称机制。实验使用包含353个(任务+程序步骤、思维链、评判通过)样例的语料,并发现基准测试存在格式偏差。跨模型验证(GPT-5.4)确认了每项发现的方向一致性(Cohen's κ≥0.754)。该研究揭示了SFT效果与模型容量之间的复杂关系,为不同规模模型的微调策略提供了经验依据。

arXiv:2605.11907v1 Announce Type: new Abstract: We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbf{Main finding.} Under matched-path LLM-only scoring, the SFT-attributable procedural-$\Delta$ lift is roughly uniform across sizes: $+0.070$ / $+0.040$ / $+0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $\Delta$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbf{Methodology.} (i) A bench format-compliance artifact: 83.5\% of the holdout uses a deterministic \texttt{ANSWER}-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU{} pass-rate within a 2\,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbf{Cross-family validation.} GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen&#39;s $\kappa \geq 0.754$, agreement $\geq 93.25\%$. Earlier ``format-only at 0.8B&#39;&#39; and ``shrinking SFT at 4B&#39;&#39; framings were path-mismatch artifacts; this paper supersedes both (Appendix~\ref{sec:appendix-path}). Single-seed; threats in \S\ref{sec:threats}.
arXiv arXiv cs.LG · 6 小时前 · 相关度 72% 热度★★☆☆☆
179
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
生成模型中奖励对齐的零阶噪声优化
推理部署学术论文

提出 ZeNO(零阶噪声优化)框架,用于扩散和流模型中的奖励对齐,将噪声优化建模为路径积分控制问题,仅需零阶奖励评估即可估计更新方向,无需生成器可微。当结合 Ornstein-Uhlenbeck 参考过程时,更新隐含执行对奖励倾斜分布的朗之万动力学采样。该方法支持推理时缩放,在多个生成器和奖励函数上表现优异,并可应用于蛋白质结构生成等不可微分奖励场景。

arXiv:2605.11347v1 Announce Type: new Abstract: Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.
arXiv arXiv cs.LG · 6 小时前 · 相关度 70% 热度★★☆☆☆
180
Counterfactual Trace Auditing of LLM Agent Skills
LLM Agent技能的反事实追踪审计
学术论文开发工具

本文提出反事实追踪审计(CTA)框架,通过对比同一任务下agent在使用和不使用某技能时的行为轨迹,将其分割为阶段并对齐,生成结构化的技能影响模式(SIP)标注。在SWE-Skills-Bench上对Claude的49个软件工程任务进行审计,发现通过率平均仅变化+0.3个百分点,但CTA识别出522个SIP实例,揭示了技能显著改变了agent行为而通过率几乎不变的评价缺口。审计还区分了通过率无法检测的几种重复性影响,如字面模板复制、脱离任务创建、过度规划和任务恢复,并发现不同基线性能的任务存在不同的主导影响模式。

arXiv:2605.11946v1 Announce Type: new Abstract: Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.
arXiv arXiv cs.AI · 6 小时前 · 相关度 70% 热度★★☆☆☆
181
QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
QuIDE: 通过主动优化掌握量化智能权衡
推理部署学术论文

本文提出统一的量化神经网络效率评估指标QuIDE,基于智能指数I=(C×P)/log₂(T+1)将压缩率、精度和延迟权衡整合为单一分数。在SimpleCNN、ResNet-18和Llama-3-8B等六种设置上的实验表明,Pareto最优位数具有任务依赖性:MNIST和大语言模型偏向4-bit量化,而复杂CNN任务(ResNet-18@ImageNet)的最佳点为8-bit,4-bit训练后量化会导致精度灾难性下降。准确度门控变体I'能正确标记原始I会奖励的不可行配置。QuIDE为混合精度搜索提供了可重现的评估协议和即用型适应度函数。

arXiv:2605.10959v1 Announce Type: new Abstract: There is currently no unified metric for evaluating the efficiency of quantized neural networks. We propose QuIDE, built around the Intelligence Index I = (C x P)/log_2(T+1), which collapses the compression-accuracy-latency trade-off into a single score. Experiments across six settings -- SimpleCNN (MNIST, CIFAR), ResNet-18 (ImageNet-1K), and Llama-3-8B -- show a task-dependent Pareto Knee. 4-bit quantization is optimal for MNIST and large LLMs, while 8-bit is the sweet spot for complex CNN tasks (ResNet-18 on ImageNet), where 4-bit PTQ collapses accuracy catastrophically. The accuracy-gated variant I&#39; correctly flags these non-viable configurations that the raw I would reward. QuIDE provides a reproducible evaluation protocol and a ready-to-use fitness function for mixed-precision search.
arXiv arXiv cs.LG · 6 小时前 · 相关度 70% 热度★★☆☆☆
182
Optimistic Dual Averaging Unifies Modern Optimizers
乐观对偶平均统一现代优化器
训练微调学术论文

本文提出SODA框架,一种乐观对偶平均的推广,统一了Muon、Lion、AdEMAMix和NAdam等SOTA优化器,将它们视为该框架的乐观实例。在此基础上设计了一个实用封装器,通过理论上合理的1/k衰减策略消除权重衰减的超参数调整。实验结果表明,SODA在各种尺度和训练时长下均能稳定提升性能,且无需额外超参调节。

arXiv:2605.11172v1 Announce Type: new Abstract: We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.
arXiv arXiv cs.LG · 6 小时前 · 相关度 70% 热度★★☆☆☆
183
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
GRAFT-ATHENA: 用于自主发现和进化数值算法的自我改进型代理团队
开发工具学术论文

GRAFT-ATHENA 是一个自我改进的代理框架,利用 LLM 驱动的规划器、求解器和评估器,将组合决策空间投影为因子化概率树,实现从指数到线性的参数压缩。框架通过将方法作为指纹嵌入度量空间,让新问题能够学习相似历史案例,在物理信息机器学习基准上超越人类与先前代理基线,还能解决重建阿波罗指令舱马赫10流场等复杂工程问题,并自主发现具有指数收敛特性的谱 PINN 等新数值方法。

arXiv:2605.11117v1 Announce Type: new Abstract: Scientific discovery can be modeled as a sequence of probabilistic decisions that map physical problems to numerical solutions. Recent agentic AI systems automate individual scientific tasks by orchestrating LLM-driven planners, solvers, and evaluators. Each method is a combination of methodological actions, with many viable combinations for any given problem and structural dependencies between choices. However, existing frameworks treat each problem in isolation, with no shared substrate to accumulate methodological experience across domains. Here we show that GRAFT-ATHENA, a self-improving agentic framework, learns from past problems and autonomously expands its own action space across diverse domains. GRAFT (Graph Reduction to Adaptive Factored Trees) projects combinatorial decision spaces into factored probabilistic trees in which each method is a single path, taking the parameter footprint from exponential to linear. In the lineage of classical Bayesian networks, the factorization is an $I$-map of the policy, and the resulting paths embed as unique fingerprints in a metric space whose closeness lets each new problem learn from similar past ones. On canonical physics-informed machine learning (PIML) benchmarks, GRAFT-ATHENA improves over human and prior agentic baselines, and on production solvers, it tackles complex engineering problems such as reconstructing Mach-10 flow over the Apollo Command Module from a 1968 report and recovering shear-thinning blood-cell rheology. Notably, the system grows its own knowledge substrate, autonomously proposing regularization constraints for ill-posed inverse problems and discovering new numerical methods such as a spectral PINN with exponential convergence. These results provide a foundation for autonomous laboratories that grow more capable with every problem they solve.
arXiv arXiv cs.LG · 6 小时前 · 相关度 70% 热度★★☆☆☆
184
How to Eliminate Pipeline Friction in AI Model Serving
如何消除AI模型服务中的流水线摩擦
推理部署

文章探讨了AI模型从训练到生产部署过程中常见的流水线摩擦问题,包括模型导出、格式转换、推理引擎适配等环节的低效。重点介绍利用TensorRT等NVIDIA推理优化工具简化部署流程、提升吞吐量的方案,并结合实际案例展示端到端性能优化的最佳实践。

<img width="768" height="432" src="https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-768x432.png" alt="" decoding="async" srcset="https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-768x432.png 768w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-179x101.png 179w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-300x169.png 300w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-625x352.png 625w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-1536x864.png 1536w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-645x363.png 645w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-660x370.png 660w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-500x281.png 500w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-160x90.png 160w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-362x204.png 362w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-195x110.png 195w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-1024x576.png 1024w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1-960x540.png 960w, https://developer-blogs.nvidia.com/wp-content/uploads/2026/05/tensorrt-optimized-industries-1.webp 1999w" sizes="(max-width: 768px) 100vw, 768px" title="tensorrt-optimized-industries" loading="lazy"/>The path from a trained AI model to production should be smooth, but rarely is. Many teams invest weeks fine-tuning models, only to discover that exporting to a...<img
AI Chip NVIDIA Technical Blog · 16 小时前 · 相关度 85% 热度★★☆☆☆
185
Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs
在 Arm CPU 和 NPU 上实现高效边缘 AI:通过实践实验室理解 ExecuTorch
推理部署开发工具

ExecuTorch 将 PyTorch 模型部署到资源受限的边缘设备,支持本地推理。Arm 推出了一套 Jupyter 动手实验室,覆盖 CPU(Cortex-A)和 NPU(Cortex-M + Ethos-U)平台,演示从模型导出、格式转换到 Runtime 执行的完整流程。实验室集成 Arm 开发的 Model Explorer 适配器,帮助开发者可视化模型部署过程,降低边缘 AI 工程化门槛。

<p>TL;DR:</p> <ul> <li>ExecuTorch extends the PyTorch ecosystem to deliver local AI inference on constrained edge devices. To provide a practical entry point, Arm has created a <a href="https://github.com/arm-education/executorch_on_arm_labs" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank">set of Jupyter Labs</a> that complement the official ExecuTorch documentation while explaining both the <i>how</i> and the <i>why</i> of each step. </li> <li>The blog and labs introduce both CPU and NPU inference, across Cortex-A and Cortex-M + Ethos-U platforms, and showcase use of Model Explorer adapters, developed by Arm, to gain visibility into model deployment with ExecuTorch. </li> </ul> <p>AI is rapidly and undisputedly becoming part of how we work and live. But today, much of that intelligence is still tied to the cloud, accessed through APIs and web interfaces. </p> <p>That model doesn’t always fit. Businesses increasingly want to bring AI closer to where it’s actually used—on devices like wearables, smart cameras, and other low-power edge systems. Running AI locally can reduce latency, improve privacy, and unlock new real-time capabilities, but it also introduces a new challenge: how do you run complex models efficiently on constrained hardware with limited memory, compute, and power? </p> <p>PyTorch has become the foremost framework for training and inferencing AI models in the cloud. ExecuTorch extends that ecosystem to bring local AI inference to the edge. It takes a PyTorch model, exports it into a lightweight format, and runs it through a runtime built specifically for edge inference. If you’re already familiar with PyTorch, the appeal is clear: you stay in the same ecosystem, while gaining a deployment path better suited to real devices.</p> <p>To make this practical, Arm has created a set of hands-on Jupyter labs that walk through the deployment process—from CPU inference on a Raspberry Pi through to hardware acceleration on Ethos-U NPUs. Wh
AI Tools Blog – PyTorch · 18 小时前 · 相关度 85% 热度★★☆☆☆
186
NVIDIA and SAP Bring Trust to Specialized Agents
NVIDIA 与 SAP 携手为专业 AI 代理注入信任
开发工具行业资讯

NVIDIA 与 SAP 宣布扩大合作,将开源运行时 OpenShell 嵌入 SAP Business AI 平台,为专业 AI 代理提供隔离执行环境、文件系统和网络层策略执行以及基础架构级容器,保障代理在企业系统中的安全运行。SAP 工程师还与 NVIDIA 共同参与 OpenShell 的开发,将其作为 Joule Studio 构建的所有 AI 代理的运行时安全层。此举旨在帮助企业在从 AI 助手向自主代理过渡时,通过边界控制、策略执行和审计跟踪建立生产级信任。

<p>From finance and procurement to supply chain and manufacturing, specialized AI agents are moving into the enterprise systems where business decisions are made, data is accessed and workflows run at scale. </p> <p>Announced today at SAP Sapphire — where NVIDIA founder and CEO Jensen Huang joined SAP CEO Christian Klein’s keynote by video — SAP and NVIDIA’s expanded collaboration helps enterprises run specialized agents with security and governance controls. </p> <p>SAP embeds <a href="https://build.nvidia.com/openshell" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank">NVIDIA OpenShell</a> — an open source runtime for securely developing and deploying autonomous AI agents — into SAP Business AI Platform. In addition, SAP engineers are codesigning OpenShell alongside NVIDIA, contributing back to the open source project.</p> <p>OpenShell provides isolated execution environments, policy enforcement at the filesystem and network layers, and infrastructure-level containment that guards against damage when agent logic fails. </p> <p>Within SAP Business AI Platform, OpenShell is the runtime security layer for all SAP AI agents, including custom agents built in Joule Studio — SAP’s environment for building and managing end-to-end enterprise agents.</p> <p>For enterprises, the shift from AI assistants to autonomous agents changes the trust equation. An agent that can touch systems of record, cross application boundaries and operate without review at every step needs boundaries, policy enforcement and an audit trail before it can become part of production work. That’s what SAP and NVIDIA are collaborating to address.</p> <h2><b>Why the Application Layer Matters</b></h2> <p>Huang has described <a href="https://blogs.nvidia.com/blog/ai-5-layer-cake/" rel="noopener noreferrer" referrerpolicy="no-referrer" target="_blank">AI as a five-layer cake</a>: energy, chips, infrastructure, models and applications. </p> <p>Applications sit at the top, where AI create
AI Chip NVIDIA Blog · 22 小时前 · 相关度 78% 热度★★☆☆☆
187
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness:自进化的代理式后训练框架
训练微调基础大模型学术论文

RewardHarness提出一种自进化的代理式奖励框架,将奖励建模转化为上下文进化而非权重优化,仅需100个偏好示例即可迭代进化工具与技能库。编排器动态选择相关工具和技能,由冻结子代理构建推理链产生偏好判断,并自动优化库而无需额外人类标注。仅使用0.05%的EditReward数据,该框架在图像编辑评估基准上平均准确率达到47.4%,超越GPT-5 5.3个百分点;作为GRPO微调的奖励信号时,RL调优模型在ImgEdit-Bench上获得3.52分。

arXiv:2605.08703v1 Announce Type: new Abstract: Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.
arXiv arXiv cs.AI · 1 天前 · 相关度 85% 热度★★☆☆☆
188
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
PAAC:隐私感知的代理设备-云协作
推理部署学术论文

PAAC提出一种隐私感知的LLM代理框架,通过将规划器-执行器分解对齐到设备-云端边界,使角色分工本身成为隐私保护机制。云端代理使用保留推理角色但不含敏感内容的类型化占位符进行推理,设备代理负责识别敏感片段并将执行结果提炼为紧凑的关键发现,所有替换和逆变换由确定性注册表完成,从而在设备端直接执行动作。在严格隐私设置下的三个代理基准上,PAAC将平均准确率提升15-36%,平均数据泄漏降低2-6倍,并在数学、科学、金融等10个领域的17项额外基准中取得一致改进。

arXiv:2605.08646v1 Announce Type: new Abstract: Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy-aware agentic framework that aligns planner--executor decomposition with the device-cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value&#39;s reasoning role while discarding its content, while the on-device agent identifies sensitive spans and distills each step&#39;s execution outcome into compact key findings. Sanitization confines the on-device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15-36\% and reducing average leakage by 2-6$\times$ over state-of-the-art device-cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
189
Kaczmarz Linear Attention
Kaczmarz 线性注意力
基础大模型性能优化

本文提出 Kaczmarz 线性注意力(KLA),一种对 Gated DeltaNet(GDN)的单标量修改,从在线回归目标出发,利用 Kaczmarz 投影方法推导出键范数归一化动态步长 β_t,用于优化状态更新。KLA 保持了原始的状态形状、门控、线性递归和块并行算法,在 0.4B 规模、1B token 训练预算下验证困惑度达到 8.09(优于 GDN 的 8.50),并稳定扩展至 65K 上下文。在控制任务上,KLA 实现单针检索 100%、8 倍多查询关联回忆提升 7.03 分,以及在 32K 上下文解码吞吐量提升 2.1 倍,表明键范数归一化系数是 delta-rule 序列模型在精度、外推和解码效率上的重要设计维度。

arXiv:2605.08587v1 Announce Type: new Abstract: Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $\beta_t = \eta_t / (\|k_t\|_2^2 + \epsilon)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
190
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
PRISM: 通过调度-内存协同设计的快速在线LLM服务
推理部署性能优化

本文针对在线大语言模型服务中普遍存在的提示分段与热点偏斜现象,提出PRISM方案,通过查询感知调度器(QAS)和需求感知基数树(DART)的协同设计,将请求准入与精确前缀KV缓存保留对齐,避免热点段的重复预填充。评估表明,在4B和13B模型上,PRISM相比最强基线将平均每次查询吞吐下的P99首token延迟分别降低23.3%和37.1%,同时将精确前缀KV缓存命中率提升5.9和12.2个百分点。

arXiv:2605.08581v1 Announce Type: new Abstract: Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse Optimization Integrated Scheduling and Memory), which co-designs a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with exact-prefix KV retention. Our evaluation results show that, versus the strongest baseline, PRISM reduces average per-QPS P99 TTFT by 23.3\% and 37.1\% while increasing exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points on 4B and 13B models, respectively.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
191
Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution
揭示专家内激活稀疏性以实现高效的混合专家模型执行
推理部署性能优化学术论文

本文发现现有预训练MoE模型内每个专家层存在高达90%的激活稀疏性,可在无精度损失的情况下被利用。通过在vLLM推理流水线中跳过非活跃神经元的计算,MoE层执行速度提升最高2.5倍,端到端推理速度提升1.2倍。该方法在1B至400B参数的8个开源MoE模型上验证,无需修改模型结构或激活函数,为MoE推理提供了一条新的加速路径。

arXiv:2605.08575v1 Announce Type: new Abstract: Mixture of Experts (MoE) architecture has become the standard for state-of-the-art large language models, owing to its computational efficiency through sparse expert activation. However, sparsity through finer expert granularity is becoming increasingly difficult to achieve due to fundamental training challenges such as expert collapse and load imbalance. In this work, we explore and leverage intra-expert activation sparsity as a complementary and underexplored dimension of sparsity in MoE models. Surprisingly, substantial intra-expert sparsity is readily available in existing pre-trained MoE models, without any modification to the activation function or model parameters, providing up to 90% sparsity within each expert without significant accuracy loss. We explore intra-expert activation sparsity across eight off-the-shelf MoE models ranging from 1B to 400B parameters, and extend the MoE execution pipeline of vLLM to leverage intra-expert activation sparsity by skipping the computations of inactive neurons, on top of its existing optimizations, achieving up to 2.5 times speedup in MoE layer execution and 1.2 times end-to-end speedup compared to the original dense vLLM baseline.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
192
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
信念还是电路?图上下文学习的因果证据
基础大模型学术论文

该研究通过图随机游走任务探究大语言模型的上下文学习机制,检验其究竟依赖局部模式匹配还是全局结构推断。利用PCA重构内部表示发现模型在中间混合比例下会同时将两种图拓扑信息编码在正交主成分子空间中,难以用单纯局部复制解释。通过残差流激活修补和图差分引导进行因果干预,晚期层修补几乎完全转移图的偏好,线性引导可按预定方向改变预测且受控实验表明该效应并非源于范数匹配或标签混淆,综合证据支持结构推断与感应电路并行的双机制理论。

arXiv:2605.08405v1 Announce Type: new Abstract: How do LLMs learn in-context? Is it by pattern-matching recent tokens, or by inferring latent structure? We probe this question using a toy graph random-walk across two competing graph structures. This task&#39;s answer is, in principle, decidable: either the model tracks global topology, or it copies local transitions. We present two lines of evidence that neither account alone is sufficient. First, reconstructing the internal representation structure via PCA reveals that at intermediate mixture ratios, both graph topologies are encoded in orthogonal principal subspaces simultaneously. This pattern is difficult to reconcile with purely local transition copying. Second, residual-stream activation patching and graph-difference steering causally intervene on this graph-family signal: late-layer patching almost fully transfers the clean graph preference, while linear steering moves predictions in the intended direction and fails under norm-matched and label-shuffled controls. Taken together, our findings are most consistent with a dual-mechanism account in which genuine structure inference and induction circuits operate in parallel.
arXiv arXiv cs.AI · 1 天前 · 相关度 85% 热度★★☆☆☆
193
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules:面向 LLM 的硬件感知专家级 CUDA 优化基准测试
芯片软件栈性能优化学术论文

该论文提出了 CUDAHercules 基准,用于评估大模型在 CUDA 编程中与人类专家级优化的差距。基准覆盖单核、模块算子、完整应用及未解决的挑战任务,跨越 Ampere、Hopper 和 Blackwell GPU 架构,并通过领域语义验证器把关端到端任务。测试 Claude-Opus-4.6 和 GPT-5.4 等模型发现,模型虽能编译通过测试,但难以复现专家级优化策略,应用语义进一步降低成功率,且迭代反馈虽提升正确性却倾向于慢速回退实现。结果表明自动化 CUDA 编程仍需更强的硬件推理、工具使用和连接代码理解与硬件架构的训练目标。

arXiv:2605.08467v1 Announce Type: new Abstract: Large language models show promise for automated CUDA programming, however even the strongest coding models (e.g., Claude-Opus-4.6) may still fall short of expert-level, architecture-aware optimization. We introduce CUDAHercules, a benchmark that evaluates generated CUDA against end-to-end human-expert SOTA systems. It spans single kernels, module-level operators, full applications, and unsolved challenge tasks across Ampere, Hopper, and Blackwell GPUs, with end-to-end tasks gated by domain-specific semantic validators. Evaluating models such as Claude-Opus-4.6 and GPT-5.4 shows a large gap between runnable CUDA and expert CUDA engineering: models often compile and pass tests, but rarely recover the optimization strategies needed to match expert performance. Application semantics further reduce success, and iterative or tool-augmented feedback can improve correctness while drifting toward slow fallback implementations. These results show that automated CUDA programming remains far from fully solved and requires stronger hardware reasoning, better tool use, and training objectives that connect code understanding to hardware architecture-grounded intelligence.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
194
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET:优化可验证奖励强化学习中的令牌预算分配
训练微调性能优化学术论文

DUET提出一种联合控制RLVR训练中prompt分配rollout数量与rollout终止时机的轻量级方法,通过预rollout代理评估prompt信息量来决定rollout数量,并采用标记门控中止规则与重要性重加权来动态停止生成。在Qwen3-1.7B上用MATH训练时,仅使用50%的token预算即超越全预算GRPO等基线,实现2.51倍壁钟加速,且预算越紧优势越大,同时方法在Qwen3-4B、Llama-3.2-3B等不同模型上得到验证。

arXiv:2605.08441v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbf{DU}al-controlled tok\textbf{E}n alloca\textbf{T}ion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop them. On Qwen3-1.7B trained on MATH, DUET outperforms full-budget GRPO and the other three budget-aware baseline methods. DUET&#39;s advantage further generalizes to other benchmarks across math and coding, and is on par with the best baseline on the scientific Q\&amp;A domain, while also achieving a $1.62\times$ wall-clock speedup. More notably, using only 50\% of the token budget, DUET still outperforms all baseline methods at their full budget, achieving an even higher $2.51\times$ speedup over full-budget GRPO. We verify the high performance of DUET on other backbone LLMs, including Qwen3-4B and Llama-3.2-3B-Instruct. Notably, the gap between DUET and the strongest baseline \emph{widens} as the budget tightens, contrary to the usual pattern in which efficient methods trade off quality as compute decreases. More broadly, these results suggest that DUET budget-aware control strategies are valuable not only for accelerating training, but also for improving the quality of the learning signal.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
195
Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
Delulu:一个面向填空任务中代码幻觉检测的验证多语言基准
学术论文基础大模型

本文提出了 Delulu,一个可验证的多语言代码幻觉检测基准,包含1951个Fill-in-the-Middle样本,覆盖7种语言和4种幻觉类型。通过对抗式管线构建的样本由前沿大模型生成、四个评判模型评估、Docker容器验证,并经过人类专家审核。在11个开源FIM模型上评估显示,最强模型pass@1仅达84.5%,所有模型均产成幻觉对齐补全,表明该基准揭示了任务内在的难度。此工作为代码大模型的幻觉问题提供了严格的诊断工具。

arXiv:2605.07024v1 Announce Type: cross Abstract: Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.
arXiv arXiv cs.AI · 1 天前 · 相关度 85% 热度★★☆☆☆
196
Exploration-Driven Optimization for Test-Time Large Language Model Reasoning
测试时大语言模型推理的探索驱动优化
训练微调推理部署学术论文

论文提出 Exploration-Driven Optimization (EDO) 方法,将鼓励多样性的探索奖励扩展至迭代式后训练,并与直接偏好优化 (iDPO) 和组相对策略优化 (GRPO) 结合,形成 ED-iDPO 和 ED-GRPO 两种变体。实验表明该方法在分布内推理基准上提升 1.0-1.3%,在分布外任务上额外提升 1.5%,同时保持了模型熵并稳定了 RL 训练动态。EDO 能有效平衡探索与利用,尤其适用于依赖测试时计算扩展的推理场景。

arXiv:2605.09853v1 Announce Type: new Abstract: Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
197
Pretraining large language models with MXFP4
使用MXFP4预训练大型语言模型
训练微调性能优化芯片软件栈

本文系统性研究了MXFP4量化在Transformer训练中的影响,发现权重梯度(Wgrad)量化是导致训练发散的主因,而前向和激活梯度的FP4量化影响较小。通过对照实验,作者证明随机舍入和随机Hadamard旋转不能稳定训练,而确定性Hadamard旋转能有效恢复优化,提示不稳定源于结构化微缩放误差而非随机性不足。实验在AMD Instinct MI355X GPU上原生MXFP4支持进行,避免了软件模拟的不确定性。

arXiv:2605.09825v1 Announce Type: new Abstract: Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
198
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES: 选择性状态空间模型中的隐式时间感知
基础大模型学术论文

提出TIDES,一种选择性状态空间模型(SSM)变体,通过将输入依赖从时间步长移至对角状态矩阵,使离散化步长恢复物理意义,从而原生处理不规则时间戳且不牺牲逐token表达性。设计Fading Flash诊断基准,系统性检验模型对输入依赖和分布外步长的泛化能力,并揭示现有架构的典型失效模式。TIDES在UEA时间序列分类和Physiome-ODE回归基准上取得最新的最先进平均排名,代码已开源。

arXiv:2605.09742v1 Announce Type: new Abstract: Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\Tilde{\Delta}$ a learned function of the input. However, in doing so, $\Tilde{\Delta}$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\Tilde{\Delta}$ and handle irregular timestamps natively ($\Tilde{\Delta}\equiv\Delta)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\Tilde{\Delta}$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $\Delta$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.
arXiv arXiv cs.LG · 1 天前 · 相关度 85% 热度★★☆☆☆
199
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
无批评家强化学习中的抵消假设:从结果奖励到令牌信用
训练微调学术论文

本文从令牌级视角研究大语言模型的critic‑free RL,揭示训练中正负样本的令牌概率变化高度相似,并提出‘抵消假设’:由于令牌间耦合作用,正负样本共有的令牌会信号抵消,而成功样本特有的推理令牌获得更强强化,从而实现从序列奖励到隐式令牌信用分配。实验表明,该方法相比仅用正样本训练,能将更新从模板/格式令牌转移到推理令牌,且被提升的令牌价值始终更高。作者据此设计了保留查询的小批次和奖励均衡批次两种简单干预,在多个模型规模上提升了RLVR训练效果。

arXiv:2605.08666v1 Announce Type: new Abstract: A commonly accepted explanation of critic-free RL for LLMs, based on sequence-level rewards, is that it reinforces successful rollouts with a positive advantage while penalizing failed ones. In contrast, we study critic-free RL from a token-level perspective, revealing the token-flipping phenomenon: positive and negative rollouts exhibit remarkably similar proportions of tokens whose probabilities are boosted or suppressed during RL training. To explain this phenomenon, we further show that a token&#39;s change in probability is not fully determined by its own advantage; coupled gradient interactions with other tokens also play a non-negligible role. Specifically, these token coupling effects occur primarily between identical tokens that are both predicted with low confidence. Building upon this analysis, we propose the cancellation hypothesis: as a result of coupling, opposing signals cancel out for tokens shared by positive and negative rollouts, while tokens more specific to successful rollouts receive stronger reinforcement, thereby inducing hidden token-level credit assignment from rollout-level rewards. We support this hypothesis with complementary empirical evidence. (1) Compared with training on only positive rollouts, critic-free RL shifts updates from template and formatting tokens toward reasoning tokens; (2) Tokens boosted by critic-free RL consistently demonstrate higher value than suppressed tokens, regardless of whether they originate from positive or negative rollouts. Guided by this view, we implement two batching interventions to encourage or preserve cancellation in critic-free RL training: query-preserved mini-batching and reward-balanced batching. Despite their simplicity, these interventions improve RLVR training across multiple model scales, supporting cancellation as both an explanatory principle and a practical design criterion for critic-free RL training.
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
200
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillMaster: 迈向 LLM 智能体的自主技能掌握
训练微调开发工具学术论文

SkillMaster 是一个训练框架,能让 LLM 智能体在任务解决过程中自主创建、优化和选择技能。其核心设计包括:基于历史轨迹的技能审查,通过反事实效用评估技能修改是否有效,以及 DualAdv-GRPO 算法分别估计任务求解与技能编辑的优势,稳定联合训练。实验表明,在 ALFWorld 和 WebShop 上成功率比最强基线分别提升 8.8% 和 9.3%,并使智能体从单纯使用技能转向能够自我发现错误、提炼并迁移技能的自改进模式。

arXiv:2605.08693v1 Announce Type: new Abstract: Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
201
Why Retrying Fails: Context Contamination in LLM Agent Pipelines
为什么重试会失败:LLM Agent流水线中的上下文污染
学术论文推理部署开发工具

本文针对LLM Agent在多步工具调用任务中重试失败的现象,首次形式化提出了上下文污染重启模型(CCRM)。模型刻画了失败尝试残留在上下文窗口中,导致后续尝试的每步错误率从基准ε₀升高到ε₁,推导出在K次尝试内成功的精确闭式概率公式、级联开销定理、使总预算B下成功率最大的最优流水线深度T*的闭式解,以及通过Le Cam方法得到的信息论下界。在SWE-bench Verified数据集上验证显示,独立同分布假设会高估pass@3达17.4个百分点,而CCRM拟合误差小于0.001,并揭示出污染后错误率提升了6.1倍,蒙特卡洛实验确认了所有理论预测。

arXiv:2605.08563v1 Announce Type: new Abstract: When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 &gt; epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam&#39;s method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
202
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
潜在人格对齐:在不提及危害的情况下提高无害性
训练微调学术论文

提出潜在人格对齐(LPA)方法,通过抽象人格特质而非具体有害行为训练模型,仅需不足100条特质陈述即可实现与15万+样本训练相当的鲁棒性。LPA在六项危害基准上的误分类率比基线降低2.6倍,且从未在训练中接触有害样本,却能更好泛化到未知攻击分布,提供了一种样本高效、成本低廉的大模型安全对齐新路径。

arXiv:2605.08496v1 Announce Type: new Abstract: Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of thousands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks -- without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
203
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
镜子里的攻击者:通过锚定双策略自博弈打破安全自我一致性
训练微调基础大模型学术论文

本文针对大语言模型安全训练中的自博弈红队方法,指出参数共享导致防御策略退化为自一致性,削弱对抗压力。作者提出锚定双策略自博弈方法,在冻结基座模型上训练独立的攻击与防御LoRA适配器,实现角色分离和稳定优化。实验表明,该方法在Qwen2.5系列模型上参效率比全量微调高100倍,安全性持续提升且不影响推理能力,攻击与防御模型在交叉实验中展现出更强的对抗防御性能。

arXiv:2605.08427v1 Announce Type: new Abstract: Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
204
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens:面向低成本LLM智能体的自适应多粒度技能复用框架
开发工具学术论文

该论文提出SkillLens,一个层次化技能演化框架,将技能组织为策略、方法、过程与原子操作四层图,并通过混合粒度检索与验证机制,实现子技能的按需重用与局部改写,以此平衡上下文相关性与推理成本。任务执行时,先语义检索技能种子,经度修正随机游走扩展候选,验证器对每个访问节点做出接受、分解、重写或跳过决策,并基于路由决策持续优化技能图和验证器。理论分析表明在稀疏不匹配假设下混合粒度自适应具有次线性成本,演化更新规则单调改进验证目标直至局部最优。实验在MuLocbench和ALFWorld上实现最高6.31个百分点准确率提升,智能体成功率从45.00%提高到51.31%。

arXiv:2605.08386v1 Announce Type: new Abstract: Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
205
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink:通过教师引导推理纠错增强小模型诊断准确性
训练微调学术论文

MedThink 提出一个两阶段知识蒸馏框架,旨在将大型语言模型的临床推理能力高效压缩到小模型中。第一阶段由教师模型筛选数据并注入领域知识解释来微调学生模型;第二阶段教师模型分析学生错误,生成推理链并实施第二轮的纠错微调。在通用医学基准和胃肠病学数据集上的实验表明,该方法优于六种蒸馏策略,最高相对基线提升12.7%,在计算资源受限场景下显著提高了小模型的诊断准确性与泛化能力。

arXiv:2605.08094v1 Announce Type: cross Abstract: Accurate clinical diagnosis requires extensive domain knowledge and complex clinical reasoning capabilities. Although large language models (LLMs) hold great potential for clinical reasoning, their high computational and memory requirements limit their deployment in resource-constrained environments. Knowledge distillation (KD) can compress LLM capabilities into smaller models, but traditional KD merely transfers superficial answer patterns and fails to preserve the structured reasoning required for reliable diagnosis. To address this, we propose a two-stage distillation framework, MedThink, designed to cultivate robust clinical reasoning in small language models (SLMs). In the first stage, a teacher LLM screens data and injects domain-knowledge explanations to fine-tune a student model, establishing a knowledge foundation. In the second stage, the teacher evaluates the student&#39;s errors, generates reasoning chains linking knowledge to correct answers, and refines the student&#39;s diagnostic reasoning through a second round of fine-tuning. We evaluate MedThink on general medical benchmarks and a gastroenterology dataset comprising 955 question-answer pairs. Experiments demonstrate that MedThink outperforms six distillation strategies in all benchmarks: achieving an improvement of up to 12.7% over the student baseline in general tasks, and reaching a total top accuracy of 56.4% in gastroenterology evaluation. This indicates that iterative distillation centered on reasoning can significantly enhance the diagnostic accuracy and generalization capabilities of SLMs whilst maintaining computational efficiency. Our code and data are publicly available at https://github.com/destinybird/PrecisionBoost.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
206
Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference
Dystruct:基于贝叶斯推理的动态结构化扩散语言模型解码
推理部署基础大模型学术论文

本文针对扩散语言模型(DLM)固定生成长度带来的灵活性受限问题,提出一种无需训练、基于贝叶斯推理的动态结构化解码框架。该方法将灵活生成长度建模为动态结构推理问题,在每次窗口扩展时联合计算扩展长度、块边界和去噪调度,并通过统一机制融合局部不确定性与结构信号,支持灵活的块扩展与块组织以保持连贯性。在多个基准上的实验表明,该方法在生成质量和灵活性上显著优于现有的固定长度和灵活长度解码基线,为扩散语言模型的结构化文本生成提供了有原则且高效的解决方案。

arXiv:2605.09820v1 Announce Type: new Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive models, primarily due to their ability to enable parallel decoding. Despite this advantage, most existing DLMs rely on a fixed generation length specified prior to decoding, which restricts their flexibility in real-world applications. While a few recent works attempt to support flexible-length generation, they typically suffer from notable limitations: some require costly retraining to accommodate variable-length outputs, while others depend solely on local confidence signals during decoding. Such local criteria fail to capture the evolving structure of the sequence, often resulting in suboptimal generation quality. In this paper, we propose a training-free, Bayesian structured decoding framework that formulates flexible-length generation as a dynamic structural inference problem. Our approach formulates flexible-length generation as a dynamic structural inference problem, jointly computing the expansion length, the block boundaries, and the decoding schedule. At each window expansion step, the method integrates local uncertainty with structural signals via a unified mechanism that supports dynamic structured generation, including both flexible block expansion and block organization, while maintaining coherence. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves generation quality and flexibility over existing fixed-length and flexible-length baselines. These results highlight the advantage of Bayesian structured decoding for diffusion language model, providing a principled and efficient solution for structured text generation.
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
207
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD:面向大语言模型的长度高效自适应动态推理
推理部署学术论文

本文提出LEAD方法,用于解决大推理模型(如OpenAI o1和DeepSeek-R1)在思维链推理中日益冗长的问题。LEAD通过在线自适应性机制动态校准正确性与效率之间的权衡,利用潜力缩放不稳定性引导优化方向,并根据模型自身正确推理的结果在线估计每个问题的自适应目标长度,施加对称效率奖励以同时惩罚过度思考和过度压缩。在五个数学推理基准上,LEAD在强化学习训练的效率推理方法中取得了最高的准确率和准确率-效率综合分数,并显著缩短了输出长度。

arXiv:2605.09806v1 Announce Type: new Abstract: Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model&#39;s own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
208
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar: 基于回归的缓存令牌注意力神经估计
推理部署性能优化学术论文

Nectar提出一种轻量级神经网络来近似长上下文推理中的softmax注意力计算,替代对所有缓存KV对的O(n)操作,使推理成本与上下文长度解耦。该方法为每层和每个KV头训练两个小网络:目标网络预测注意力输出,得分网络预测对数归一化常数。在1.7B至8B参数模型、五个长上下文数据集上的实验表明,近似误差与完整注意力的下一个令牌准确度差距相关,且跨层非均匀分配容量可进一步缩小差距。配备Nectar模块的模型在文本生成任务中语义内容与全缓存注意力版本匹配。

arXiv:2605.09778v1 Announce Type: new Abstract: Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|\theta|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
209
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL:端到端优化智能体经验驱动的自进化能力
训练微调学术论文

本文提出 Evolving-RL 框架,将大语言模型智能体的经验提取与利用视为统一过程,通过强化学习联合优化。该方法从经验评估中导出两个监督信号,分别优化提取器和求解器,实现协调共进化。在 ALFWorld 和 Mind2Web 测试中,分布外任务相对基线(GRPO)最高提升 98.7% 和 35.8%,且即使没有测试时经验积累,也能通过将经验模式内化到模型参数中显著提升可见和不可见任务性能。

arXiv:2605.10663v1 Announce Type: new Abstract: Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model&#39;s capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs&#39; ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.
arXiv arXiv cs.AI · 1 天前 · 相关度 82% 热度★★☆☆☆
210
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
Metal-Sci:苹果芯片上用于进化式大模型核搜索的科学计算基准
芯片软件栈性能优化学术论文

本文提出了 Metal-Sci,一个包含 10 项科学计算核(如模板、N 体问题、分子动力学、FFT 等)的基准,运行于苹果芯片的 Metal 框架上。每个任务都配有 CPU 参考实现、基于 roofline 的适应度函数和未见过的泛化尺寸,用于评估由冻结大模型驱动的 (1+1) 进化自动核搜索。该框架实时编译候选核、在不同尺寸下评分,并将结构化诊断信息反馈给大模型以迭代优化。在 M1 Pro 上测试 Claude Opus 4.7、Gemini 3.1 Pro 和 GPT 5.5,实现了 1.00× 至 10.7× 的加速比;关键创新是 held-out 门控评分函数 Φ_T,能在搜索结束后发现仅凭分布内得分无法察觉的静默性能回退或错误结果。

arXiv:2605.09708v1 Announce Type: new Abstract: We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in $n$-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a $(1{+}1)$ evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span $1.00\times$ to $10.7\times$. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function $\Phi_\mathcal{T}$ (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at $2.95\times$ speedup but collapses to $0.23\times$ on a $256^3$ held-out cube, a silent regression that the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
211
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
让每个Token都算数:用KV缓存淘汰提升长上下文性能
推理部署性能优化学术论文

该论文针对长上下文推理中KV缓存导致的内存与计算瓶颈,提出一种基于全局保留的学习式KV淘汰方法。它通过轻量保留门为每个缓存条目分配效用分数,并利用共享的最终评分投影跨层、跨头校准分数,使所有token全局竞争缓存容量。理论分析表明,优先保留有用token能减轻注意力稀释,而几何保留可作为查询无关的未来效用代理。在长文本、视觉语言推理及多轮对话基准上,该方法大幅降低KV内存的同时,匹配甚至超越了全缓存推理性能。

arXiv:2605.09649v1 Announce Type: new Abstract: The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token&#39;s future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.
arXiv arXiv cs.LG · 1 天前 · 相关度 82% 热度★★☆☆☆
212
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
异构LLM多智能体系统的迭代批判与路由控制器
学术论文推理部署

本文针对多智能体LLM系统中控制器仅能一次路由的局限,提出一种迭代批判与路由控制器,将协调过程建模为有限时域马尔可夫决策过程(MDP)。控制器在每一轮评估当前生成结果,决定是否停止或继续调用另一个模型进行优化,并通过策略梯度在拉格朗日松弛目标下学习。在七个推理基准上的实验表明,该方法在显著减少最强模型调用次数的同时(低于总调用的25%),性能一致优于现有基线,缩小了与最强智能体的差距。

arXiv:2605.08686v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
213
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra:面向强化学习中MoE训练的路由重放引导负载均衡
训练微调性能优化

该论文针对大模型MoE架构在强化学习训练中因热点专家频繁变化导致的负载不均衡问题,提出ReLibra系统。它利用RL工作流的特点,在rollout阶段已知token到专家的路由决策,从而在训练前获取精确负载信息。ReLibra在批次间通过专家重排序实现跨节点均衡,在批次内动态复制专家以吸收微批次级负载波动,充分利用分层网络带宽。在多种MoE LLM和RL任务上的实验显示,训练吞吐量最高可达Megatron-LM的1.6倍、EPLB的1.2倍,且性能仅比理想均衡基线低6%-10%。

arXiv:2605.08639v1 Announce Type: new Abstract: Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL&#39;s rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
214
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
不同提示不同秩:基于SVD的LLM压缩中提示感知的动态秩选择
推理部署性能优化

本文提出PARSE框架,解决基于SVD的LLM压缩中静态秩截断问题。通过训练线性路由器实现提示感知的动态秩选择,利用语义相似的提示共享秩模式,并在推理时从模式缓存中直接服务,避免重复计算。结合专家内存聚合和kernel融合等系统级优化,PARSE在与四种主流SVD压缩方法集成后,在LLaMA-7B上平均任务准确率提升10%,同时实现最高2.5倍prefill加速和2.4倍decode加速。

arXiv:2605.08568v1 Announce Type: new Abstract: Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose $\textbf{PARSE}$, a post-training framework for $\textbf{P}$rompt-$\textbf{A}$ware $\textbf{R}$ank $\textbf{S}$election as $\textbf{E}$xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 $\times$ prefill and 2.4 $\times$ decode speedup over native SVD execution.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
215
Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
每参数token覆盖度对鲁棒的大语言模型缩放律外推至关重要
学术论文训练微调

本文研究了基于Chinchilla最优计算法则的缩放律拟合中,固定每参数token比例(TPP)的设计会导致参数估计的高度病态,原因在于模型参数量与token量指数相近时设计矩阵的条件数急剧膨胀。作者在四种缩放律形式下证明了这种共线性设计使得尺度系数几乎不可辨识,置信区间膨胀一个数量级以上,导致训练射线外的外推性能急剧下降。通过推导TPP多样性阈值的闭式解,实验表明非共线设计在四个定律、五个语料库上的保留分割测试中取得97.3%的胜率,揭示了缩放律外推鲁棒性的关键因素。

arXiv:2605.08541v1 Announce Type: new Abstract: Neural scaling laws approximate a language model&#39;s loss as a power-law function of parameter count $N$ and token count $D$. Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio $k$ and set $D = kN$. We show that this collinear design, combined with the empirically common near-equality of the exponents governing $N$ and $D$, induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the $N$ and $D$-exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy&#39;&#39; model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form TPP-diversity threshold that is necessary and sufficient for well-conditioned estimation. Empirically, non-collinear designs outperform collinear ones on held-out splits with a 97.3\% win rate across four laws, five corpora, multiple floating point precision modes. We further show the degeneracy is rooted in Jacobian geometry and is not an artifact of the loss function: any smooth estimation objective whose curvature involves the Jacobian inherits the same ill-conditioning.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
216
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
利用自生成数据进行中间训练提升语言模型的强化学习效果
训练微调学术论文

本文研究在强化学习(RL)微调之前,通过自生成的多样化推理数据对大语言模型进行中间训练(mid-training),以提升后续RL的效果。方法基于Polya问题求解思路生成多种正确答案变体,理论分析表明中间训练能改善RL中的策略梯度更新。实验显示,在数学推理、代码生成和叙事推理等基准上,采用该中间训练初始化的RL模型取得了一致性提升。

arXiv:2605.08472v1 Announce Type: new Abstract: The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya&#39;s problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
217
Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
Transformer能够为上下文高斯核回归实现预条件理查德森迭代
学术论文基础大模型

本文证明了标准softmax注意力Transformer可以在前向传播中隐式执行预条件理查德森迭代,从而近似高斯核岭回归的预测过程。作者构造了一个单头Transformer,使用O(log(1/ε))个块和一定宽度的MLP,即可对长度为N的提示实现ε精度的预测。该构造揭示了softmax注意力层生成行归一化的高斯核算子以处理跨token交互,而ReLU MLP局部近似更新所需的标量运算。通过在GPT-2风格Transformer上的高斯过程回归实验,作者观察到逐层预测误差与预条件理查德森迭代的步骤输出最为一致,并通过消融实验进一步支持了这种机制解释。

arXiv:2605.08475v1 Announce Type: new Abstract: Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/\epsilon))$ blocks and MLP width $O(\sqrt{N/\epsilon})$ that achieves $\epsilon$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer&#39;s layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
218
训练微调学术论文

本文提出一个区分大模型后训练中“能力诱导”与“能力创造”的分析框架。作者引入“可访问支持”概念,描述模型在有限预算下实际可生成的行为空间;在此空间内重新加权行为概率属于能力诱导,而改变该空间本身则属于能力创造。通过自由能视角,将有监督微调(SFT)和强化学习(RL)统一视为对预训练参考分布进行重加权,仅外部信号不同;当更新接近基模型时,主要效果是局部重加权而非能力创造。这一视角将研究焦点从SFT与RL的形式区分转移到行为空间是否被扩展这一本质问题上。

arXiv:2605.08368v1 Announce Type: new Abstract: Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model&#39;s reachable behavioral space through search, interaction, tool use, or the incorporation of new information.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
219
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
面向可扩展且可信赖智能系统的强化学习研究
训练微调学术论文

本博士论文围绕强化学习在大规模分布式环境与安全对齐方面的双重挑战展开研究。前半部分聚焦联邦学习场景下的高效通信与异步优化,提升多智能体系统的可扩展性;后半部分针对大语言模型后训练阶段,研究基于人类偏好的策略对齐与上下文感知的安全信息披露,增强可信性。整体论证了下一代智能系统必须同时具备高效优化与可信行为,并将强化学习作为统一框架。

arXiv:2605.08378v1 Announce Type: new Abstract: Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
220
A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
统一Pair-GRPO家族:从隐式到显式偏好约束实现稳定通用的RL对齐
训练微调

本文提出了Pair-GRPO家族,包含Soft-Pair-GRPO和Hard-Pair-GRPO两种变体,用于解决RLHF中策略更新不稳定、梯度方向模糊和方差高等问题。Soft-Pair-GRPO用二元配对偏好奖励替代GRPO的标量奖励,并证明了其梯度与标准GRPO梯度正数倍等价的定理,解释了其经验稳定性;Hard-Pair-GRPO进一步引入显式局部概率约束和约束KL拟合优化,抑制梯度噪声和全局策略漂移。在HH-RLHF和UltraFeedback等对齐基准及HalfCheetah连续控制任务上,该方法在对齐质量、人类偏好胜率、训练稳定性和泛化能力方面优于现有SOTA基线,并通过消融实验验证了各组件的关键作用。

arXiv:2605.06375v1 Announce Type: cross Abstract: Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO&#39;s clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO&#39;s gradient is a positive scalar multiple of standard GRPO&#39;s gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
221
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine:通过无需训练的预执行精炼提升工具使用代理的可靠性
推理部署学术论文

提出 RubricRefine,一种无需训练的预执行可靠性层,可为工具调用代码生成专用评分量规,在执行前检查并修复输出形状错误、工具路由错误等合约违规,迭代精炼候选代码。在 M3ToolEval 基准上,该方法在零次实际执行的情况下将 7 个模型的平均得分提升至 0.86,优于所有基线推理时方法,且延迟较最强非迭代方案降低 2.6 倍;在主要依赖单步工具的 API-Bank 上性能持平,验证了其对工具间合约结构的依赖。

arXiv:2605.09730v1 Announce Type: new Abstract: Iterative self-refinement is a popular inference-time reliability technique, but its effectiveness in code-mode tool use depends heavily on the structure of the feedback signal: unstructured critique helps inconsistently across models, and even revision with real execution feedback improves only modestly ($0.75$ vs. $0.65$ baseline). The dominant failures are inter-tool contract violations - wrong output shape, incorrect tool routing, broken argument provenance - that run to completion without raising errors, making runtime feedback insufficient. We introduce RubricRefine, a training-free pre-execution reliability layer that generates task- and registry-specific rubrics, scores candidate code against explicit contract checks, and iteratively repairs failures before any execution occurs. With zero execution attempts, RubricRefine reaches $0.86$ on M3ToolEval averaged across seven models-improving over prior inference-time baselines on every model tested on this benchmark, at $2.6X$ lower latency than the strongest non-iterative alternative - and remains flat on the predominantly single-step API-Bank, consistent with the method&#39;s reliance on inter-tool contract structure. A rubric-category ablation and calibration analysis further characterize when and why the method works.
arXiv arXiv cs.LG · 1 天前 · 相关度 80% 热度★★☆☆☆
222
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
PRISM:多智能体LLM流水线中机密泄露的生成时检测与缓解
推理部署开发工具学术论文

本文针对多智能体LLM系统中敏感信息通过共享上下文传播泄露的风险,提出了PRISM实时防御方法。PRISM将凭证泄露视为生成过程中的序列化风险累积问题,在每一步解码时融合16种跨词汇、结构、信息论、行为与上下文的信号,将风险分级为绿、黄、红区域进行逐令牌干预。核心发现是机密复现前常伴随熵崩溃与logit集中等可测量的生成动态偏移,结合标识符模式检测等文本结构线索可提前预警。在包含13类攻击与三种压力等级的2000任务对抗基准测试中,PRISM实现了F1=0.832,精确率1.0,任务级泄露率0.0%,输出效用0.893,显著优于最对比方法。

arXiv:2605.10614v1 Announce Type: new Abstract: Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
223
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
Agent-First Tool API:面向企业AI代理系统的语义接口范式
开发工具学术论文

本文针对传统CRUD风格API与自主AI agent的不匹配问题,提出Agent-First Tool API范式。该范式包含三部分:六动词语义协议(search/resolve/preview/execute/verify/recover),提供置信度、证据链和操作建议的规范化工具合同(NTC),以及静态能力策略与动态风险升级的双层治理管道。在包含85个工具、6个业务领域的多租户SaaS平台上验证,50个实际任务测试显示任务成功率从CRUD基线的64%提升至88%,人工干预减少72.7%,自主错误恢复能力提升5.8倍。该范式作为语义应用层,与传输层标准MCP正交互补。

arXiv:2605.10555v1 Announce Type: new Abstract: As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
arXiv arXiv cs.AI · 1 天前 · 相关度 80% 热度★★☆☆☆
224
Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching
草图与验证:通过程序草图实现结构化推理时扩展
推理部署性能优化学术论文

本文提出 SKETCHVERIFY 方法,针对廉价小模型(如 Gemini 3.1 Flash Lite)在代码生成中,利用推理时额外计算资源进行结构化搜索。它让 LLM 枚举 K 种不同算法策略,为每种生成包含占位符的程序草图,再对每个草图填充 M 次,产生 K×M 个结构多样的候选,最后通过执行验证和指纹聚类选择最佳解。在 HumanEval+ 上,相同候选数量下草图方法显著优于平坦采样:Lite 模型贪心失败的 19 题中,K=2,M=5 恢复率达 58%,而平坦 N=10 仅 26%;即使在 3 倍预算下平坦也无法追平。该方法在无法升级模型时是提升推理性能的经济有效途径,且能与基于执行的语义投票方法组合使用。

arXiv:2605.08658v1 Announce Type: new Abstract: SKETCHVERIFY is a within-tier cost-performance policy, not a universal accuracy improvement. The operational question: a practitioner stuck with a small, cheap code model (here, Gemini 3.1 Flash Lite) for latency, deployment, or budget reasons -- how should they spend a small amount of extra test-time compute? SKETCHVERIFY factorizes the search space: the LLM enumerates K distinct algorithmic strategies, writes a program sketch for each (a partial program with ?? holes), and fills each sketch M times, producing K x M structurally diverse candidates that are verified by execution and selected by fingerprint clustering. Each extra sketch is guaranteed to explore a different algorithm; each extra flat sample likely duplicates an existing one. Our central evidence is a cost-quality Pareto plot on HumanEval+ across three Gemini tiers (Lite, Flash, Pro), and a reanalysis of the 19 problems where Lite greedy fails. Two findings: (1) Within-tier, sketching dominates flat sampling at matched candidate count. On the hard subset, Lite Sketch K=2, M=5 recovers 11/19 (58%) vs. flat N=10 at 5/19 (26%, +32pp); Lite Sketch K=10, M=10 recovers 15/19 (79%) vs. flat N=100 at 10/19 (53%, +26pp). Flat cannot close the gap even at ~3x the budget: flat N=50 still loses to Sketch K=2, M=5 by +11pp. (2) Cross-tier, sketching does not replace upgrading. Pro greedy (89%) dominates Lite Sketch K=10, M=10 (79%) on both pass@1 and dollar cost. Practitioner rule: if a stronger tier is available, use greedy on it; otherwise sketching is the cost-effective way to spend extra compute. We characterize the K-vs-M trade-off via a Flash Lite scaling sweep, report HumanEval+ saturation on Flash and Pro, and show the method composes cleanly with execution-based selection from the concurrent Semantic Voting line of work.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
225
Lattice Deduction Transformers
格演绎变换器
基础大模型学术论文

提出格演绎变换器(LDT),一种循环Transformer,通过将隐状态投影到格结构来近似逻辑上可靠的演绎。训练过程模拟搜索约束求解器的演绎机制,采用抽象解释进行领域无关的监督。800K参数LDT在极端数独和雪花数独上实现100%准确率,1.8M参数版本在迷宫难题上达到99.9%准确率,成本显著低于以往小型循环推理器,而前沿大型语言模型在此类基准上准确率为0%。

arXiv:2605.08605v1 Announce Type: new Abstract: We introduce the Lattice Deduction Transformer (LDT), a recurrent transformer that approximates logically sound deduction by projecting its latent state through a lattice between forward passes. We train on-policy in a process that mirrors deduction in a search-based constraint solver and supervise training via a domain-agnostic, abstract-interpretation-based approximation of the set of solution candidates. An $800$K-parameter LDT achieves $100\%$ accuracy on Sudoku-Extreme and Snowflake Sudoku, at a fraction of the training cost of prior small recurrent reasoners, while remaining empirically sound: the model returns a correct answer or abstains. A $1.8$M-parameter variant reaches $99.9\%$ accuracy on Maze-Hard. Frontier LLMs score $0\%$ on all three benchmarks.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
226
Finer is Better (with the Right Scaling)
更细更好(配合正确的缩放)
推理部署学术论文

本文研究了大语言模型在超低精度(FP4)量化中观察到的块大小悖论:直觉上更细的块大小应降低量化误差,但标准abs-max缩放反而导致模型质量下降。作者揭示根本原因是重尾张量分布与FP4粗糙上量化区间的交互不良,并提出多种干预措施,如防止缩放因子下溢为零、采用4-over-6方法等,证明了通过算法修正可使标准格式(如OCP E4M3)达到定制宽指数格式的性能。实验在多个大模型上验证了方法有效性,彻底解决了该悖论,并稳定提升了困惑度指标。

arXiv:2605.08565v1 Announce Type: new Abstract: Microscaling is a critical technique for preserving the quality of Large Language Models (LLMs) quantized to ultra-low precision formats. Intuitively, finer block sizes should yield lower quantization error; however, a paradox recently identified in the literature demonstrates that standard abs-max scaling can actually degrade model quality as block sizes shrink. In this work, we investigate the underlying mechanics of this phenomenon. We demonstrate that this degradation is not an inherent limitation of finer granularity, but is primarily driven by heavy-tailed tensor distributions interacting poorly with the coarse upper quantization bins of the FP4 element format. Specifically, we show that i) preventing the scaling factor from underflowing to zero mitigates localized errors, ii) targeted algorithmic interventions like the 4-over-6 methodology effectively correct the quantization geometry for large elements, and iii) a brute-force search establishes an optimal baseline, confirming that the theoretical Mean Squared Error (MSE) strictly improves with finer block sizes. Ultimately, our findings reveal a valuable interchangeability: applying the correct algorithmic recipe allows standard, hardware-compliant formats (like OCP E4M3) to match the performance of custom, wider-exponent formats (like UE5M3). We validate these results across several large language models, fully resolving the block size paradox and achieving robust downstream perplexity improvements.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
227
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve:通过异步阶段编排加速Agent自我进化
推理部署性能优化

本文提出FlashEvolve框架,针对基于LLM的Agent自我进化中因同步阶段执行和负载不均导致的墙钟时间过长问题,引入异步工作器与队列实现阶段重叠,并处理数据过期问题。该框架利用语言空间过期信息的可读性,让LLM反思和修正过期产物转化为进化信号。通过投机性阶段完成和自适应工作流控制进一步提升吞吐和令牌效率,在本地vLLM和API服务上分别获得3.5倍和4.9倍的提案吞吐量提升。

arXiv:2605.08520v1 Announce Type: new Abstract: LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
228
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
衡量真正重要的指标:医疗领域生成式、多模态和代理式AI的基准测试
基础大模型学术论文

这篇论文指出了当前医疗AI基准测试的局限性:现有基准大多只衡量模型的知识水平,无法评估其在真实临床环境中的可靠性、安全性和临床相关性。数据显示,前沿模型在医学执照考试中得分接近满分,但在真实临床任务(如病历记录、决策支持、管理工作流)上表现显著下降,揭示了“部署准备充分”的假象。作者呼吁建立一个原则性的基准设计框架,以系统测量AI在复杂临床工作流中的实际表现,区分模型能力不足与评估方法失效的差异。

arXiv:2605.08445v1 Announce Type: new Abstract: AI models are increasingly deployed in live clinical environments where they must perform reliably across complex, high-stakes workflows that standard training and validation datasets were never designed to capture. Evaluating these systems requires benchmarks: structured combinations of tasks, datasets, and metrics that enable reproducible, comparable measurement of what a model can do. The central challenge in healthcare AI is not performance alone, but the absence of systematic methods to measure reliability, safety, and clinical relevance under real-world conditions. Most existing benchmarks test what a model knows; too few test whether it can perform reliably and without failing across the full complexity of real clinical tasks. Current benchmarks have accumulated through ad hoc dataset construction optimized for narrow task performance: frontier models achieve near-perfect scores on medical licensing examinations, but when evaluated across real clinical tasks, performance degrades sharply, scoring 0.74--0.85 on documentation, 0.61--0.76 on clinical decision support, and only 0.53--0.63 on administrative and workflow tasks \cite{medhelm}. High benchmark scores give a false sense of deployment readiness, and the gap between performance and utility widens precisely as AI systems take on more consequential clinical roles. Without a principled framework for benchmark design, the field cannot determine whether poor clinical performance reflects model limitations or failures in how performance is being measured.
arXiv arXiv cs.AI · 1 天前 · 相关度 78% 热度★★☆☆☆
229
When Independent Sampling Outperforms Agentic Reasoning
当独立采样优于Agent推理时
推理部署学术论文

本文研究在固定推理计算预算下,针对竞争性编程任务分配推理时间计算的策略。在216道Codeforces题目上,比较了基于Agent的推理与重复独立采样(k-shot),发现在准确率-成本和准确率-查询量权衡上,k-shot持续优于Agent推理,即使在提示缓存下仍存在每条调用效率差距。结论指出,对于自包含的算法任务,独立探索在现实资源约束下更具优势,并提供了使对数失败概率每美元成本最优的求解器预算分配分析。

arXiv:2605.08478v1 Announce Type: new Abstract: We study how to allocate inference-time compute for competitive programming under fixed budgets. Evaluating 216 Codeforces problems across Divisions 1-3, we compare agent-based reasoning with repeated independent sampling (k-shot) as a function of both cost and number of model calls. Across models and difficulty levels, k-shot consistently achieves a better accuracy-cost and accuracy-query tradeoff. This gap persists despite prompt caching in agent frameworks, indicating lower per-call effectiveness. Our results show that, for self-contained algorithmic tasks, independent exploration can outperform deeper agentic reasoning under realistic resource constraints. We also provide a budget-allocation analysis when the inference budget is fixed, and prove that a cost-optimal solver minimizes the principled metric log failure likelihood per dollar.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
230
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MemQ:将Q学习整合到基于溯源DAG的自进化记忆智能体中
开发工具学术论文

MemQ 提出一种面向 LLM 智能体的记忆增强方法,通过 TD(λ) 资格迹将记忆的 Q 值反馈沿溯源有向无环图(DAG)反向传播,以依赖链的结构近似替代时间距离,实现记忆价值的动态累积。该方法在 OS 交互、函数调用、代码生成、多模态推理等六个基准上全部取得最高成功率,尤其在多步任务中受益最深,最高提升 5.7 个百分点。论文还将问题形式化为外生上下文 MDP,解耦任务流与内生记忆存储,并给出 γ 与 λ 参数的选取指南。

arXiv:2605.08374v1 Announce Type: new Abstract: Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code will be available soon.
arXiv arXiv cs.AI · 1 天前 · 相关度 78% 热度★★☆☆☆
231
Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias
反思性提示策略优化:轨迹驱动的修订与显著性偏差
学术论文开发工具

该论文提出R2PO框架,将LLM分为搜索LLM和评论LLM,利用完整轨迹行为证据替代标量奖励进行策略优化。在10个环境中,使用200亿参数开源模型,R2PO取得了最高平均最佳奖励,并比深度RL和此前LLM方法更早收敛且更稳定。研究识别出显著性偏差这一主要失败模式(在CartPole中解释了76.6%的回归),并通过聚合统计、中位轨迹选择和修订规则加以缓解。该方法展示将轨迹作为一阶上下文证据可使较小LLM更快、更精准地搜索策略空间。

arXiv:2605.08315v1 Announce Type: new Abstract: Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO&#39;s gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
232
Entropy-informed Decoding: Adaptive Information-Driven Branching
基于熵信息的解码:自适应信息驱动分支
推理部署学术论文

本文提出了一种即插即用、模型无关的解码框架 EDEN,通过估计 LLM 输出分布的熵来自适应分配计算:在高熵区域同步扩展更多候选,低熵区域更贪婪,从而逼近更大宽度束搜索的效果但总展开次数更少。该框架从理论上证明单调熵分支因子在相同展开预算下优于固定分支因子,实验表明其在数学推理、代码生成和科学问答等任务上一致提升输出质量,并且实现了更好的精度-扩展开销权衡。

arXiv:2605.09745v1 Announce Type: new Abstract: Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam search, best-of-n, majority voting) can improve upon greedy decoding, both approaches suffer from limitations: sampling generally commits to a single path, while search often expends excessive computation regardless of task complexity. To address these, we introduce Entropy-informed decoding (EDEN), a plug-and-play, model-agnostic decoding framework that adaptively allocates computation based on the model&#39;s own uncertainty, approximating higher-width beam search with fewer expansions. At each generation step, EDEN estimates the entropy of the output token distribution and adjusts the branching factor monotonically with the entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions, improving token efficiency. Experiments across complex tasks, including mathematical reasoning, code generation, and scientific questions, demonstrate that EDEN consistently improves output quality over existing decoding strategies, achieving better accuracy-expansion trade-offs than fixed-width beam search. By treating next-token selection as a noisy maximisation problem, we prove that branching factors monotone in entropy are guaranteed to find better (i.e. more probable) continuations than any fixed branching factor within the same total expansion budget, and derive explicit regret rates characterising the benefit of the adaptive allocation.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
233
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
推理并非免费:面向LLM作为裁判的鲁棒自适应成本高效路由
推理部署学术论文

本文研究了具有显式推理能力的大语言模型在自动裁判场景下的收益与成本,发现推理对数学、编程等需结构化验证的任务显著提升准确性,但对简单任务增益有限且计算开销大。为此提出RACER方法,将推理路由建模为带KL散度不确定集的分布鲁棒优化问题,在固定预算下动态选择推理或非推理法官,理论保证最优策略唯一且线性收敛。实验表明RACER在分布漂移下实现了更优的准确率-成本权衡。

arXiv:2605.10805v1 Announce Type: new Abstract: Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.
arXiv arXiv cs.AI · 1 天前 · 相关度 78% 热度★★☆☆☆
234
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP:在动态、相互依赖的大规模工具沙箱中评估LLM智能体
学术论文开发工具

该工作提出了ComplexMCP基准,基于MCP协议构建了超过300个精心测试的工具,覆盖办公套件与金融系统等7种有状态沙箱,能够模拟动态环境状态与不可预知的API故障。评估显示,即使顶尖LLM在完整上下文和RAG范式下成功率也不足60%,远低于人类的90%。轨迹分析揭示了三个关键瓶颈:工具检索随动作空间扩大而饱和、智能体因过度自信忽略环境验证、以及倾向于合理化失败而非恢复的“策略性认输”倾向。

arXiv:2605.10787v1 Announce Type: new Abstract: Current LLM agents are proficient at calling isolated APIs but struggle with the &#34;last mile&#34; of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
arXiv arXiv cs.AI · 1 天前 · 相关度 78% 热度★★☆☆☆
235
Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies
学习数据选择的多指标权重:一种联合任务-模型自适应的高效代理框架
训练微调学术论文

本文提出一种面向大语言模型指令微调的数据选择框架,通过学习多维度质量指标的权重,实现数据选择对下游任务和特定模型的双重自适应。方法利用上下文学习信号在小型验证集上构建高效性能代理,避免全量微调即可确定最优权重配置。在Mistral、Qwen和Llama等模型上,仅使用30%的训练样本(GSM8K数据集)便达到甚至超过全量数据微调的性能,并揭示了推理任务中语义多样性与逻辑复杂性之间的权衡。

arXiv:2605.09665v1 Announce Type: new Abstract: Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30\% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.
arXiv arXiv cs.LG · 1 天前 · 相关度 78% 热度★★☆☆☆
236
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
AgentPSO:通过多智能体粒子群优化进化推理技能
学术论文基础大模型

该论文提出 AgentPSO 框架,受粒子群优化启发,将多个智能体的推理技能表示为自然语言状态,通过迭代更新其“速度”和“个人/全局最优技能”以及自我反思方向,在不修改底层大模型参数的情况下进化推理能力。实验显示,该方法在数学和通用推理基准上优于静态单智能体技能和仅测试时的多智能体推理基线,且进化出的技能可跨基准和模型迁移,表明学习到的是可复用的推理过程而非单纯优化提示。

arXiv:2605.08704v1 Announce Type: new Abstract: Multi-agent reasoning has shown promise for improving the problem-solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi-agent methods rely on inference-time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce AgentPSO, a particle-swarm-inspired framework for evolving multi-agent reasoning skills. AgentPSO treats each agent as a particle-like reasoner whose state is a natural-language skill and whose velocity is a semantic update direction, iteratively moving agents toward stronger skill states to improve both individual and collective reasoning performance. Across training iterations, each agent updates its skill by combining its previous velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors from both their own experiences and the strongest skills discovered by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single-agent skills and test-time-only multi-agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark-specific prompts. Code is open-sourced at https://github.com/HYUNMIN-HWANG/AgentPSO/.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
237
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
MIND-Skill:通过多智能体归纳与演绎生成质量保证的技能
开发工具学术论文

本文提出MIND-Skill框架,用于自动从Agent成功轨迹中提取可复用的技能。框架包含归纳智能体(抽象技能)和演绎智能体(按技能重建轨迹),并引入重建损失、结果损失和评分标准损失来保证技能质量和抽象粒度。三个文本损失通过TextGrad联合优化,在AppWorld和BFCL-v3任务上验证,其生成的技能质量稳定优于现有方法。

arXiv:2605.08670v1 Announce Type: new Abstract: Large language model (LLM) powered AI agents have emerged as a promising paradigm for autonomous problem-solving, yet they continue to struggle with complex, multi-step real-world tasks that demand domain-specific procedural knowledge. Reusable agent skills, which encapsulate successful problem-solving strategies, offer a natural remedy by enabling agents to build on prior experience. However, curating such skills has largely remained a manual endeavor, requiring human experts to distill rich domain knowledge into actionable guidelines. In this work, we present $\textbf{M}$ulti-agent $\textbf{IN}$duction and $\textbf{D}$eduction for $\textbf{Skill}$s ($\textbf{MIND-Skill}$), a framework that automatically induces generalizable skills from successful trajectories with robust quality guarantees. MIND-Skill consists of an induction agent which is tasked to abstract reusable skills from successful trajectories, and a deduction agent which aims to reconstruct trajectories by following the induced skills. To guarantee the quality of the generated skills, we introduce a reconstruction loss that compares input and reconstructed trajectories, an outcome loss that enforces the correctness of the reconstructed trajectories, and a rubric loss that assesses the documentation quality and regularizes the abstraction level of the generated skills according to predefined criteria. These textual losses are jointly optimized with TextGrad, and the resulting skills are evaluated on held-out tasks unseen during optimization. Experiments on AppWorld and BFCL-v3 show that MIND-Skill consistently outperforms concurrent skill generation methods.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
238
Evaluating Developmental Cognition Capabilities of LLMs
评估大语言模型的发展认知能力
基础大模型学术论文

本文引入罗伯特·基根的建设性发展理论,设计了一种20题的开发句子完成测试(DSCT),用于评估大语言模型(LLM)在发展认知阶段上的表现。实验在三种场景下测试:模拟人格、真实人类回答和模型默认生成。结果显示,前沿LLM能高准确率识别模拟人格的设定发展阶段;在真实人类DSCT回答上,人类与LLM一致性尚可,但精细阶段匹配较弱;不同模型家族的默认回答呈现稳定阶段差异,且更大更新的模型倾向于生成更高阶段评分的文本。该工作揭示了阶段信号在合成文本中比人类文本更易检测,为构建阶段感知的对话AI提供了评估工具。

arXiv:2605.08549v1 Announce Type: new Abstract: Conversational AI is increasingly personalized around users&#39; preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality. We draw on Robert Kegan&#39;s constructive-developmental theory as a complementary lens on this dimension. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive. To make this perspective tractable for LLM evaluation, we introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text. Throughout, we treat the resulting labels as characterizations of stage-like structure in elicited responses, not as validated person-level developmental stage. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers. On simulated personas, top frontier models recover simulator-intended labels with high accuracy. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-rated text. These results suggest that stage-conditioned signal is cleaner in synthetic responses than in human-written DSCT text, and that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
239
Can Revealed Preferences Clarify LLM Alignment and Steering?
揭示偏好能否厘清大语言模型的对齐与引导?
基础大模型训练微调

本文提出一种基于揭示偏好的实验流程,用于评估大语言模型在不确定性决策中的内在目标函数。通过诱导模型对未知事件的概率分布和决策选择,并拟合离散选择模型,可恢复出最符合模型决策的隐含成本函数。该方法被用于检验模型行为的一致性、能否准确口述自身目标,以及是否可通过提示引导其采纳用户指定的效用函数。在四个医疗诊断领域和多个前沿/开源模型上的实验显示,虽然许多模型具有一定内部一致性,但在准确报告或按用户指令调整偏好方面仍存在严重不足。

arXiv:2605.08556v1 Announce Type: new Abstract: LLMs are increasingly used to make or support high-stakes decisions under uncertainty, where alignment depends not only on factual accuracy but on how models weigh tradeoffs between different outcomes. We present an empirical pipeline for estimating the implied preferences that an LLM&#39;s observed choices optimize: we elicit the model&#39;s probability distribution over unknowns along with the choice it would make for the decision task and then fit a discrete choice model to recover the cost function that best rationalizes the model&#39;s decisions. We show how this revealed-preference description allows rigorous evaluation of whether models behave in a consistently goal-directed way, whether they can verbalize a description of their objectives which matches their revealed decision policy, and whether prompting can reliably steer those policies to implement a user-specified cost function. We apply this evaluation across four medical diagnosis domains and multiple frontier and open-source models. We find that while many models have a nontrivial degree of internal coherence, they also have significant weaknesses in faithfully reporting or adopting preferences in response to user direction.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
240
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
CoCoDA:面向工具增强智能体的共同进化组合型DAG框架
开发工具学术论文

本文提出CoCoDA框架,通过共同进化的组合代码有向无环图(DAG)统一管理工具库与规划器。图中的节点代表原始或复合工具,边编码调用依赖,节点附带类型签名、前置/后置条件与示例。推理时采用Typed DAG检索,依次通过签名统一、描述排序、行为规范过滤和示例消歧,逐步缩小候选集以减少上下文消耗;训练时将成功轨迹折叠为经过验证的复合工具,并利用基于DAG的奖励信号激励复合工具的使用。在数学推理、表格分析和代码任务上,该方法使8B学生模型达到或超越32B教师模型,并持续优于现有工具使用与工具库学习基线。

arXiv:2605.08399v1 Announce Type: new Abstract: Tool-augmented language models can extend small language models with external executable skills, but scaling the tool library creates a coupled challenge: the library must evolve with the planner as new reusable subroutines emerge, while retrieval from the growing library must remain within a fixed context budget. Existing tool-use and skill-library methods typically treat tools as flat or text-indexed memories, causing prompt cost to grow with library size and obscuring the typed, compositional structure of executable code. We propose CoCoDA, a framework that co-evolves the planner and tool library through a single code-native structure: a compositional code DAG. Nodes are primitive or composite tools, edges encode invocation dependencies, and each node stores a typed signature, description, pre/post-condition specification, and worked examples. At inference time, Typed DAG Retrieval prunes candidates by symbolic signature unification, ranks survivors by descriptions, filters them by behavioral specifications, and disambiguates with examples, keeping expensive context materialization on progressively smaller candidate sets. At training time, successful trajectories are folded into validated composite tools, while the planner is updated with a DAG-induced reward that credits composites by their primitive expansion size. We provide theoretical results showing retrieval cost reduction, sublinear retrieval time, compositional advantage under the shaped reward, monotone co-evolution under conservative updates, and DAG well-formedness. Across mathematical reasoning, tabular analysis, and code task benchmarks, CoCoDA enables an 8B student to match or exceed a 32B teacher on GSM8K and MATH and consistently improves over strong tool-use and library-learning baselines.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
241
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint:面向大语言模型的自动化验证组合推理实例生成
学术论文基础大模型

该论文提出了一个名为MathConstraint的硬适性基准测试,用于评估LLM的组合推理能力。它通过约束满足问题与严格求解器验证相结合,并设计自适应生成器,可创建难度随模型能力提升而保持挑战性的实例。基准包含MathConstraint-Easy和MathConstraint两个难度级别,前沿模型在这两个集合上的准确率分别介于72.6%-87.6%和18.5%-66.9%,显示出基准对模型进步的抵抗力。研究还评估了在沙盒环境中集成SAT/SMT求解器时的工具使用行为,工具访问使平均准确率提升28个百分点,但工具调用预算减半会导致最多37个百分点的性能下降。

arXiv:2605.08498v1 Announce Type: new Abstract: We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ($266$ instances), on which frontier models achieve between $72.6\%$ (gemini-3.1-flash-lite) and $87.6\%$ (gpt-5.5) accuracy, and MathConstraint ($329$ instances) on which the same models drop to between $18.5\%$ (claude-4.6-sonnet) and $66.9\%$ (gpt-5.5) accuracy, demonstrating the resilience of our benchmark generator against rapid progress in LLM reasoning capabilities. We evaluate 12 frontier and open-weight models with and without access to a sandboxed Python environment that includes generic SAT/SMT solvers. Tool access roughly doubles frontier accuracy on MathConstraint (mean $+28$pp; up to $+52$pp for claude-4.6-sonnet). Further, halving the tool-call budget from $8$ to $4$ rounds erases up to $37$ points -- a sensitivity that most single-budget benchmarks miss. We release the generator, dataset, and evaluation harness as a robust environment for studying combinatorial reasoning and tool-use behavior under adversarially-tunable difficulty.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
242
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver: 基于大语言模型的自动化CUDA调试基准测试
芯片软件栈开发工具学术论文

本文提出CUDABeavor基准,用于评估大语言模型修复CUDA代码的能力,防止修复退化为牺牲性能的安全替换。该基准包含213个真实CUDA生成失败的工作空间,每个任务提供错误代码、构建/测试命令和原始错误信息。作者引入协议条件度量pass@k(M,C,A),通过控制性能损失容忍度,发现即使是轻微的性能要求也会使成功分数降低多达40个百分点,揭示了当前LLM在CUDA调试中的真实水平。

arXiv:2605.08455v1 Announce Type: new Abstract: Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
243
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
自动评分准则作为奖励:从隐式偏好到显式多模态生成准则
训练微调学术论文基础大模型

本文提出Auto-Rubric as Reward (ARR)框架,将奖励建模从隐式权重优化重构为基于显式准则的分解,在成对比较前将视觉语言模型的内部偏好外化为特定提示的评分标准,抑制评估偏差。进一步提出Rubric Policy Optimization (RPO),将ARR的结构化多维评估蒸馏为鲁棒的二元奖励,稳定策略梯度。在文本生成图像与图像编辑任务上,ARR-RPO优于成对奖励模型和VLM评委,证明显式外化隐式偏好知识能实现更可靠、数据高效的多模态对齐。

arXiv:2605.08354v1 Announce Type: new Abstract: Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM&#39;s internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR&#39;s structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
244
Geometry-Aware Discretization Error of Diffusion Models
几何敏感的扩散模型离散化误差分析
学术论文推理部署

本文从数值逼近角度分析扩散模型采样的离散化误差,推导了欧拉-丸山方法的弱误差和Fréchet误差的一阶渐近展开,显式表达了在高斯数据下误差如何依赖协方差谱、扩散调度及扩散项系数。该分析揭示了数据几何与扩散参数之间的相互作用,能够为扩散模型推理调度提供可优化的目标,从而在有限推理步数下降低误差。实验验证了该理论在不同几何结构(如图像生成和图像后验采样)上的鲁棒性,为扩散模型的推理加速与参数调优提供了理论指导。

arXiv:2605.08392v1 Announce Type: new Abstract: Practical diffusion sampling is a numerical approximation problem: under a fixed inference budget, one must simulate a reverse-time ODE or SDE using only a limited number of denoising steps, so discretization error is often the dominant source of error. Existing non-asymptotic analyses provide convergence guarantees, but are typically too loose and too insensitive to diffusion parameters to guide practical design: broad families of schedules receive the same rates, which depend on coarse worst-case quantities such as the dimension or the drift Lipschitz constant. We take a less ambitious but more informative route. In the exact-score setting, we derive first-order asymptotic expansions of the Euler-Maruyama weak and Fr\&#39;echet discretization errors. These formulas hold for general smooth reverse diffusions and become fully explicit under Gaussian data. They show how discretization error adapts to the geometry of the data through the covariance spectrum, and how this geometry interacts with key diffusion parameters, including the diffusion schedules and the diffusion-term coefficient. This yields tractable objectives for geometry-aware parameter optimization. Finally, we show that the qualitative predictions of the Gaussian formulas remain robust across diffusion sampling problems with different geometries, including image generation on different datasets and image posterior sampling.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
245
NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning
NEXUS:用于安全鲁棒具身规划的符号约束持续学习
学术论文基础大模型

本文针对大语言模型在具身智能中存在的不确定性与物理世界确定性安全的鸿沟,提出模块化框架NEXUS。该框架将物理可行性与安全规范解耦,通过闭环执行反馈提升智能体能力,并将概率风险评估转化为确定性硬约束,以实施动作前安全防御。在SafeAgentBench上的实验表明,NEXUS在成功完成任务的同时能有效拒绝不安全指令,抵御对抗攻击,并通过知识积累持续提升规划效率。

arXiv:2605.09387v1 Announce Type: new Abstract: While Large Language Models (LLMs) have catalyzed progress in embodied intelligence, a fundamental gap between their inherent probabilistic uncertainty and the strict determinism and verifiable safety required in the physical world. To mitigate this gap, this paper introduces NEXUS, a modular framework designed for continual learning in embodied agents. Different from prior works that treat symbolic artifacts merely as static interfaces, NEXUS leverages them for symbolic grounding and knowledge evolution. The framework explicitly decouples physical feasibility from safety specifications: capability of agents is improved through closed-loop execution feedback, while probabilistic risk assessments are grounded into deterministic hard constraints to establish a rigorous pre-action defense. Experiments on SafeAgentBench demonstrate that NEXUS achieves superior task success rates while effectively refusing unsafe instructions, exhibiting robust defense against adversarial attacks, and progressively improving planning efficiency through knowledge accumulation.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
246
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
通过定制化代理推理赋能视觉语言模型实现小样本多模态时间序列分类
基础大模型学术论文

本文提出MarsTSC框架,首次将视觉语言模型(VLM)与代理推理结合用于小样本多模态时间序列分类。框架包含生成器、反思器和修改器三个协作角色,利用自进化知识库通过反思式推理动态优化上下文,防止上下文崩溃。在12个主流时间序列基准上,使用6种VLM backbone的广泛实验表明,MarsTSC在小样本条件下能显著且持续地提升性能,超越传统和基于基础模型的方法,同时提供可解释的决策依据。

arXiv:2605.09395v1 Announce Type: new Abstract: In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
247
Concordia: Self-Improving Synthetic Tables for Federated LLMs
Concordia:面向联邦大语言模型的自改进合成表格
训练微调学术论文

提出Concordia框架,解决联邦学习中LLM在表格任务上因数据隔离和非独立同分布导致的适应难题。采用三层次优化:客户端使用LoRA在合成表上微调LLM,并训练轻量效用评分器以重加权合成样本;外部层面则通过群组相对策略优化(GRPO)改进合成表生成器,且不共享生成器参数。在金融和医疗隐私敏感基准上的实验表明,该方法在性能、跨客户端稳定性和分布偏移鲁棒性上均优于静态或分离的合成数据基线。

arXiv:2605.09855v1 Announce Type: new Abstract: Federated learning (FL) enables training large language models (LLMs) without sharing raw data, but adapting LLMs under strict data isolation and non-IID client distributions remains challenging in practice. Synthetic data offers a natural privacy-preserving surrogate for local training, yet existing federated pipelines typically treat synthetic generation as static or loosely coupled with downstream optimization, leading to rapidly diminishing utility under heterogeneous clients. We study federated adaptation of LLMs on tabular tasks where raw records and validation data cannot be shared, and local training must rely entirely on synthetic tables. We propose Concordia, a tri-level optimization framework that aligns synthetic data generation with federated validation utility despite these constraints. At the client level, models are adapted via parameter-efficient LoRA training on synthetic tables. Clients additionally learn lightweight utility scorers from private validation feedback to reweight synthetic samples during local training. At the outer level, each client refines its own synthetic table generator using group-relative policy optimization (GRPO), guided by an ensemble of heterogeneous scorers shared across clients, without aggregating generator parameters or exposing validation data. Experiments on privacy-sensitive tabular benchmarks from finance and healthcare demonstrate that Concordia consistently improves federated performance, cross-client stability, and robustness to distribution shift compared to static and decoupled synthetic-data baselines.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
248
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
记住决策而非描述:一种面向智能体记忆的率失真框架
学术论文开发工具

本文针对长程语言智能体在有限运行时内存下的记忆组织问题,提出以决策为中心的率失真视角,将记忆质量定义为由压缩导致的可实现决策质量损失,从而导出安全遗忘边界和记忆-失真权衡前沿。基于此,设计了一种在线记忆学习器DeMem,仅在数据证明共享状态会导致决策冲突时才细化分区,并证明了近乎极小极大遗憾保证。在合成诊断和长程对话基准测试中,DeMem在相同运行内存预算下取得了持续性能提升,验证了记忆应保留对决策重要的区分而非描述的准确性。

arXiv:2605.10870v1 Announce Type: new Abstract: Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
249
CLEF: EEG Foundation Model for Learning Clinical Semantics
CLEF:面向临床语义学习的脑电图基础模型
基础大模型训练微调

CLEF是一个长上下文脑电图基础模型,将脑电会话表示为3D多锥谱图令牌,实现会话级别的Transformer建模,并通过对比学习与神经科报告和结构化电子健康记录对齐。在包含234项任务、26万次以上脑电会话的基准测试中,CLEF在229项任务上超越先前模型,平均AUROC从0.65提升至0.74。仅重建预训练已优于先前模型,结合报告和EHR对齐带来额外增益,且表征可迁移至未观测的对齐目标,验证了会话级临床对齐的表示学习范式。

arXiv:2605.10817v1 Announce Type: new Abstract: Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
250
Probing Cross-modal Information Hubs in Audio-Visual LLMs
探究音视频大语言模型中的跨模态信息枢纽
基础大模型学术论文

本研究针对音视频大语言模型(AVLLMs)的内部机制,分析了音频与视觉模态间的跨模态信息流。发现AVLLMs主要在汇点令牌(sink tokens)中编码集成的音视频信息,并进一步识别出一类专门的跨模态汇点令牌负责存储此类跨模态信息。基于此发现,论文提出了一种无需额外训练的幻觉缓解方法,通过鼓励模型依赖跨模态汇点令牌中的集成信息来减少幻觉。

arXiv:2605.10815v1 Announce Type: new Abstract: Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
251
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
模型容量通过竞争性记忆与泛化速度决定“顿悟”现象
训练微调学术论文

论文从信息论角度研究了模型容量如何影响“顿悟”(grokking)现象。在模运算任务上,发现顿悟并非发生在模型容量刚好能记忆训练集时,而是来自于两个可测量时间尺度的竞争:记忆速度 T_mem(P) 和泛化速度 T_gen(P),两者均为模型参数量 P 的函数。当参数规模使这两种速度交汇时,顿悟现象出现。该框架还发现更大模型记忆速度更快,并给出了基于模型容量和任务复杂度预测记忆速度的经验模型。

arXiv:2605.09724v1 Announce Type: new Abstract: Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.
arXiv arXiv cs.LG · 1 天前 · 相关度 75% 热度★★☆☆☆
252
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
LLARS:支持领域专家与开发者协作的大语言模型提示、生成与评估平台
开发工具

LLARS是一个开源平台,旨在连接领域专家与开发者,为构建基于大语言模型(LLM)的系统提供端到端流水线。该平台集成了三大模块:协同提示工程支持实时共同编写和版本控制;批量生成可按所选提示、模型和数据配置输出,并具有成本控制;混合评估结合人与LLM共同评估输出,利用一致性指标和溯源分析找出最佳模型-提示组合。实验反馈表明该平台直观省时,能无缝促进跨学科协作。

arXiv:2605.10593v1 Announce Type: new Abstract: We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
253
LLM Jaggedness Unlocks Scientific Creativity
LLM能力锯齿性解锁科学创造力
基础大模型推理部署学术论文

本文针对大模型科学创意能力进行研究,提出SciAidanBench基准,用于衡量模型在开放科学问题上生成有效创意的数量。在评估19个基础模型后发现,模型能力的提升呈现不均匀的“锯齿状”,在通用创意与科学创意之间、不同提示词与科学子领域间均存在显著差异。作者进一步展示了如何利用这种锯齿性,通过推理时计算、知识池化和头脑风暴等方法组合多个模型,构建元集成体,最终实现超越单一模型的科学创意表现,将模型能力的碎片化转化为一种可杠杆化的资源。

arXiv:2605.10574v1 Announce Type: new Abstract: As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.
arXiv arXiv cs.AI · 1 天前 · 相关度 75% 热度★★☆☆☆
254
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
可查询低秩适配:基于共享低秩更新原子的指令正则化路由方法
训练微调学术论文

本文提出一种数据自适应的参数高效微调方法,将固定的层内低秩更新替换为跨层共享的可查询低秩更新原子库。模型根据当前低秩状态和跨块运行摘要生成查询,通过注意力机制检索出内容相关的原子组合,并在低秩瓶颈内应用动态路由算子,从而兼顾效率与上下文自适应能力。此外引入指令正则化,利用语言先验偏置路由logit,引导更新朝向语义相关方向。在噪声非线性回归和LLM微调实验中,该方法在可训练参数量相当的情况下提升了测试性能和训练稳定性。

arXiv:2605.08423v1 Announce Type: new Abstract: We present a data-adaptive method for parameter-efficient fine-tuning of large neural networks. Standard low-rank adaptation methods improve efficiency by restricting each layer update to a fixed low-rank form, but this static parameterization can be too rigid when the appropriate correction depends on the input and on the evolving depth-wise computation of the network. Our approach replaces a purely layer-local adapter with a shared queryable memory of low-rank update atoms. For each block of layers, the model forms a query from the current low-rank state and a running summary of previous blocks, uses this query to retrieve a content-dependent combination of shared update components via attention, and applies the resulting routed operator within the low-rank bottleneck. In this way, the method retains the efficiency and scalability of low-rank adaptation while allowing the effective update to vary across inputs and to share reusable structure across layers. The resulting architecture provides a principled middle ground between static LoRA-style updates and fully generated parameter updates: it remains compact and parameter-efficient while supporting dynamic, context-sensitive adaptation. Further, we incorporate instruction-regularization by augmenting routing logits with a language-induced prior over update atoms, thereby biasing the selection of low-rank transformations toward semantically relevant directions without generating unconstrained parameter updates. Experiments on noisy non-linear regression tasks and LLM fine-tuning suggest that this queryable update-memory formulation can improve final test performance and training stability compared to standard low-rank adaptation, while using a comparable number of trainable parameters.
arXiv arXiv cs.LG · 1 天前 · 相关度 74% 热度★★☆☆☆
255
学术论文

该论文研究了交互式大语言模型辅助医生进行急诊诊断的效果。通过MedSyn系统,医生可基于主诉逐步查询拥有完整病历的LLM。在MIMIC-IV病例集上,实习医生处理困难病例的正确率从0.589显著提升至0.734,标准化正确率中等效果(d=0.47)。对话分析揭示了不同资历医生的策略差异,且跨资历诊断一致性增加。研究表明交互式LLM支持能有效增强诊断推理能力。

arXiv:2605.08533v1 Announce Type: new Abstract: Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in live physician workflows remains sparse. MedSyn lets physicians iteratively query an LLM provided with the full clinical record while initially viewing only the chief complaint. Seven physicians (three seniors, four residents) completed baseline and AI-assisted sessions across 52 MIMIC-IV cases stratified by difficulty. Blinded evaluation showed residents&#39; Hard-case correctness rose from 0.589 to 0.734; difficulty-standardised completely-correct rates confirmed a medium effect ({\Delta} = 0.092; p = 0.071; d = 0.47). Automated metrics corroborated these gains: standardised any-match accuracy improved by 0.156 (p &lt; 0.0001), and residents showed the largest F1 gain ({\Delta} = 0.138; p &lt; 0.0001). Dialogue analysis revealed expertise-dependent strategies (seniors asked targeted, hypothesis-driven questions; residents relied on broader queries) and cross-expertise concordance increased ({\Delta} = 0.145; p &lt; 0.0001). Interactive LLM support meaningfully enhances diagnostic reasoning.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
256
Log analysis is necessary for credible evaluation of AI agents
日志分析对于AI智能体可信评估的必要性
开发工具学术论文

本文指出当前AI智能体基准测试仅报告最终通过/失败结果,会因捷径、基准缺陷和隐藏危险行为而损害评估可信度。作者提出通过系统记录并分析智能体输入、执行轨迹和输出的日志分析方法,来克服有效性威胁,并给出了威胁分类和指导原则。在tau-Bench Airline上的实验表明,日志分析能揭示接近50%的未充分激发的性能以及仅看结果指标无法发现的部署失败模式,最后面向基准创建者、模型开发者等不同角色给出了推动日志分析采纳的实用建议。

arXiv:2605.08545v1 Announce Type: new Abstract: Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
257
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
Skill-CMIB:通过条件多模态信息瓶颈实现多模态智能体一致动作
开发工具基础大模型

针对大模型多模态智能体在长期任务中执行不一致的问题,提出条件多模态信息瓶颈(CMIB)方法。该方法将技能构建分解为两个阶段:首先通过文本瓶颈蒸馏出可解释的文本技能卡片,然后以文本技能为条件,对剩余预测性感知信息进行多模态压缩,从而去除跨模态冗余,获得可复用的多模态技能,无需推理时多样本解码即可提升执行稳定性。

arXiv:2605.08526v1 Announce Type: new Abstract: While LLM-based agents excel at planning and executing long action sequences, their execution often remains inconsistent across trials, limiting reliability. Consolidating agent consistency requires distilling trial-error trajectories into reusable skills that preserve task-relevant invariants while discarding trajectory-specific noise. However, in multimodal settings, the key challenge is not only that useful invariants are distributed across vision and language information, but that different modalities support different kinds of reusable skill content: while some skills are verbalizable and interpretable, others reside in perceptual evidence beyond text. Text-only skills may lose perceptual cues, whereas storing text and perception naively introduces redundancy and noise. Existing inference-time methods, such as self-consistency, improve reliability through costly multi-sample decoding, while internalization strategies lack a way to separate verbalizable skill content from residual perceptual information. To address this, we introduce Conditional Multimodal Information Bottleneck (CMIB), a method for multimodal skill construction. CMIB begins with a joint bottleneck over multimodal skills and derives an exact sequential decomposition: (1) a text-stage bottleneck distilling interpretable skill cards, and (2) a conditional multimodal bottleneck compressing only residual information in perception that remains predictive beyond text. Unlike naive two-stream formulations, CMIB explicitly conditions the multimodal latent on the text skill, thus structurally reducing cross-modal redundancy and enabling independent control over textual and perceptual compression. We instantiate CMIB with a variational objective that makes its conditional decomposition tractable to optimize, yielding reusable multimodal skills that improve execution stability without incurring multi-sample inference overhead.
arXiv arXiv cs.LG · 1 天前 · 相关度 72% 热度★★☆☆☆
258
Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models
政治可塑性:大语言模型中意识形态适应性的分析
基础大模型学术论文

本文研究了大语言模型(LLM)的“政治可塑性”,即模型根据用户提供的上下文调整政治立场的适应能力。作者基于Lester(1996)的框架,构建了200个覆盖经济自由和个人自由的测试问题,并探索了系统提示、主题提示和少样本用户提示等方法诱导政治偏见的效果。实验发现,小型或较旧的LLM展现出有限或不稳定的政治可塑性,而较新前沿模型可沿经济自由轴发生显著的意识形态偏移;同时,在反向问题表述和跨语言实验中还观察到意外偏差,暗示可能存在数据泄露。

arXiv:2605.08415v1 Announce Type: new Abstract: Since the advent of Large Language Models (LLMs), a significant area of research has focused on their intrinsic biases, particularly in political discourse. This study investigates a different but related concept, &#34;political plasticity&#34;, which is defined as the capacity of models to adapt their responses based on the user supplied context. To analyze this, a testing framework was developed using an expanded corpus of 200 politically-oriented questions across economic and personal freedom axes, based on a prior framework by Lester (1996). The study explored several methods to induce political bias, including simplified and topic-based system prompts, as well as user prompts with few-shot examples. The results show that while system prompts were largely ineffective, user prompts successfully elicited significant ideological shifts, particularly along the Economic Freedom axis in larger and newer models. Through a validation experiment, we examined whether models answer questionnaires by recognizing the underlying question format. Inverting the sense of the questions revealed unexpected, counter-intuitive shifts in most models, suggesting potential data leakage. Finally, we also analyzed how model plasticity varies when the experiment is conducted in different languages. The results reveal subtle yet notable shifts across each of the analyzed languages. Overall, our results indicate that small and older LLMs exhibit limited or unstable political plasticity, whereas newer frontier models display reliable, expected adaptability.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
259
Playing games with knowledge: AI-Induced delusions need game theoretic interventions
与知识博弈:AI引发的认知错觉需要博弈论干预
基础大模型推理部署学术论文

论文指出对话式AI会导致用户陷入认知固化与错觉螺旋,将其建模为廉价磋商博弈,并揭示用户满意度优化会引发逢迎策略,造成即使是理性用户也会走向错误信念。作者提出在推理时引入“认知中介”干预机制,通过施加认知摩擦信号迫使不同认知类型的用户分离,打破均衡;同时设计了受Git启发的“信念版本化”系统,在检测到验证性需求时回滚健康信念。模拟结果显示该机制可使信念螺旋发生率产生48倍的差异,证明AI认知安全的关键在于战略信息环境设计,而非单纯的模型对齐。

arXiv:2605.08409v1 Announce Type: new Abstract: Conversational AI has a fundamental flaw as a knowledge interface: sycophantic chatbots induce epistemic entrenchment and delusional belief spirals even in rational agents. We propose the problem does not stem from the AI model, rooted instead in a systemic consequence of the paradigm shift from user-driven knowledge search to users and agents engaged in strategic, repeated-play communication. We formalize the problem as a Crawford-Sobel cheap talk game, where costless user signals induce a pooling equilibrium. Agents optimized for user satisfaction produce sycophantic strategies that provide identical reinforcement across user types with opposite epistemic incentives: exploratory ``Growth-seekers&#39;&#39; ($\theta_G$) and confirmatory ``Validation-seekers&#39;&#39; ($\theta_V$). Under repeated play, this identification failure creates a coordination trap -- analogous to a Prisoner&#39;s Dilemma -- where locally rational feedback loops drive users toward pathologically certain false beliefs. We propose an inference-time mechanism design intervention called an Epistemic Mediator that breaks this pooling equilibrium by introducing a costly signal (epistemic friction), forcing type revelation based on users&#39; asymmetric cognitive costs for processing resistance. A key contribution is Belief Versioning, a git-inspired epistemic meta-memory system that stores healthy beliefs and rollbacks when validation-seeking resistance is detected. In simulation, this intervention achieves a separating equilibrium achieving a $48\times$ differential in spiral rates while passing a learning preservation criterion), evidence that epistemic safety in AI is fundamentally a problem of strategic information environment design rather than simple model alignment.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
260
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
视觉语言模型的可靠性藏在哪里:注意力、隐藏状态与因果回路的机制研究
基础大模型学术论文

本研究针对一个普遍直觉——注意力图越集中,模型回答越可靠——进行了系统检验。通过对LLaVA-1.5、PaliGemma、Qwen2-VL三个家族(3-7B参数)的机制分析,发现注意力结构与正确性几乎无关(点双列相关系数接近零),但注意力在特征提取中依然因果必要。可靠性的可读性出现在更晚的阶段:单个隐藏状态线性探针在POPE等基准上达到AUROC>0.95,而自一致性(K=10)是开销10倍推理下最强的行为预测因子。因果消融揭示了架构分化:早融合模型将可靠性分布化、抗破坏性强,晚融合LLaVA则将可靠性集中在脆弱的晚期瓶颈,对顶层探针神经元消融导致8.3个百分点的物体识别精度下降。

arXiv:2605.08200v1 Announce Type: new Abstract: A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p&lt;0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC&gt;0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with &lt;=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
261
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch:协同进化技能、记忆与策略以实现个性化研究自动化
开发工具学术论文

NanoResearch 是一个 LLM 驱动的多智能体框架,旨在解决研究自动化中的个性化缺失问题。它通过三层协同进化机制实现:技能库将重复操作提炼为可复用的程序规则;记忆模块保留用户和项目相关的历史经验以支撑规划;无标签策略学习将自由形式的反馈转化为规划器的持续参数更新。实验表明,该方法在提升研究质量和降低成本方面优于现有 AI 研究系统。

arXiv:2605.10813v1 Announce Type: new Abstract: LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user&#39;s research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.
arXiv arXiv cs.AI · 1 天前 · 相关度 72% 热度★★☆☆☆
262
Human-Inspired Memory Architecture for LLM Agents
用于LLM智能体的仿人类记忆架构
学术论文基础大模型

论文提出一种受生物启发的记忆架构,包含六种认知机制:睡眠阶段巩固、基于干扰的遗忘、记忆痕迹成熟、检索后再巩固、实体知识图谱和混合多线索检索,旨在解决LLM代理长期记忆积累的缺陷。作者还引入一种合成校准方法,无需暴露基准数据即可推导流程阈值。在VSCode问题跟踪数据集上,去重巩固实现97.2%的保留精度并减少58%存储;在LongMemEval个人聊天基准上,200K令牌预算下检索准确率匹配原始基线(70.1% vs 71.2%),同时提供可调的精度/存储权衡曲线。

arXiv:2605.08538v1 Announce Type: new Abstract: Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture comprising six cognitive mechanisms: (1) sleep-phase consolidation, (2) interference-based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi-cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue-tracking dataset (13K issues, 120K events) where deduplication-based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal-chat benchmark where we conduct the first streaming M-tier evaluation (475 sessions, ~540K unique turns). At a 200K-token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store-size operating curve. At S-tier scale (50 sessions), dedup-based consolidation yields a +13.3 pp improvement in preference recall.
arXiv arXiv cs.AI · 1 天前 · 相关度 70% 热度★★☆☆☆
263
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
智能体对智能体存在的利用:智能体控制论是基础智能体缺失的科学
基础大模型学术论文

本文提出当前基于大模型的基础智能体(foundation agents)在长程复杂任务中主要依赖工程经验,缺乏第一性原理的理论支撑。作者将经典控制论的六条法则映射为六项智能体设计原则,并综合为可靠性、终身运行和自我改进三个工程要求,形成“智能体控制论”框架。该框架在代码生成、计算机使用和自动化研究三个应用领域进行了分析,识别了失败模式并给出了具体工程建议,旨在为基础智能体的可靠部署建立科学基础。

arXiv:2605.10754v1 Announce Type: new Abstract: LLM-based foundation agents that perceive, reason, and act across thousands of reasoning steps are rapidly becoming the dominant paradigm for deploying artificial intelligence in open-ended, long-horizon complex tasks. Despite this significance, the field remains overwhelmingly engineering-driven. Engineering practice has converged on useful primitives (tool loops, memory banks, harnesses, reflection steps), yet these are assembled by empirical trial and error rather than from first principles. Fundamental questions remain open: under what conditions does a long-running agent remain on-task? How should an agent respond when its environment exceeds its representational capacity? What architectural properties are necessary for safe self-improvement? We argue that cybernetics, the mid-twentieth-century science of control and communication in complex systems, provides the missing theoretical scaffold for foundation agents. By mapping six canonical laws of classical cybernetics onto six agent design principles, and synthesizing those principles into three engineering desiderata (reliability, lifelong running, and self-Improvement), we arrive at a framework termed Agent Cybernetics. Three application domains, code generation, computer use and automated research, exemplify the analytical framework of agent cybernetics by identifying failure modes and concrete engineering recommendations. We hope that agent cybernetics opens a new research venue and establishes the scientific foundation that foundation agents need for principled, reliable real-world deployment.
arXiv arXiv cs.AI · 1 天前 · 相关度 70% 热度★★☆☆☆
264
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD:面向小语言模型代理的逐步在线策略蒸馏
训练微调学术论文

本文提出SOD(Step-wise On-policy Distillation)框架,解决小语言模型在工具集成推理中因错误工具调用引发的级联误差和师生分布分歧加剧问题。SOD在每一步根据步级散度自适应调整蒸馏损失权重,在高散度区域降低不可靠的教师信号强度,在低散度区域保留稠密监督。在数学、科学和代码等困难基准上,SOD最多超越次优基线20.86%,0.6B学生模型在AIME 2025上取得26.13%的准确率,证明代理推理可有效蒸馏到轻量级模型。

arXiv:2605.07725v1 Announce Type: new Abstract: Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher&#39;s token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
arXiv arXiv cs.CL · 2 天前 · 相关度 95% 热度★★☆☆☆
265
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD:面向流匹配模型的在线策略蒸馏
训练微调基础大模型学术论文

本文针对流匹配文本到图像模型在多任务对齐中的奖励稀疏和梯度干扰问题,提出首个统一后训练框架Flow-OPD。该框架采用两阶段对齐:先通过单奖励GRPO微调训练领域专家教师,再通过流式冷启动和三步在线策略蒸馏将异构专业知识整合到学生模型,并引入流形锚定正则化以缓解纯RL驱动对齐导致的美学退化。基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO整体提升约10个点,同时保持图像保真度并展现“超越教师”效应,为构建通用文生图模型提供了可扩展的对齐范式。

arXiv:2605.08063v1 Announce Type: new Abstract: Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a &#39;seesaw effect&#39; of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent &#39;teacher-surpassing&#39; effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
arXiv arXiv cs.CV · 2 天前 · 相关度 90% 热度★★☆☆☆
266
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics: 从外部评估到内部推理指导
训练微调学术论文

本文提出Think-with-Rubrics新范式,将评分标准(rubric)从外部评估器转变为大模型自身推理过程中的内部指导。训练时模型依次生成评分标准与回答,并通过评分标准验证器评估二者一致性进行联合监督。在多个指令遵循基准上,该方法平均超越以黄金评分标准为奖励的基线3.87分。实验进一步表明,黄金评分标准与自生成评分标准分别通过提升自生成标准质量和增强回答内部一致性来驱动性能提升。

arXiv:2605.07461v1 Announce Type: new Abstract: Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy&#39;s primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model&#39;s generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM&#39;s generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.
arXiv arXiv cs.CL · 2 天前 · 相关度 90% 热度★★☆☆☆
267
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
大语言模型的拓扑增强对齐:轨迹拓扑损失与拓扑偏好优化
训练微调学术论文

本文针对大模型对齐训练忽略表征空间全局几何结构的问题,提出拓扑增强框架。在SFT阶段,引入轨迹拓扑损失TTL,利用0维持续同调从提示和答案嵌入中提取“提示-答案桥”,约束模型更新方向与拓扑桥而非随机方向对齐。在DPO阶段,提出拓扑偏好优化TPO,构建主题特定语义偏好向量,将拒绝和选定回复的改进方向与该向量对齐,并采用动态权重平衡DPO与TPO损失。在Qwen2.5-7B-Instruct上使用UltraChat和HH-RLHF评估,拓扑增强目标在自动偏好度量和LLM判断中一致超越强非拓扑基线,同时保持或改善毒性指标。结果表明持久同调和轨迹几何为可控对齐提供了有前景的新方向。

arXiv:2605.07172v1 Announce Type: new Abstract: Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract &#34;prompt-answer bridges.&#34; TTL aligns the model&#39;s actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.
arXiv arXiv cs.CL · 2 天前 · 相关度 90% 热度★★☆☆☆
268
A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
A^2RD:面向长视频一致性的代理自回归扩散架构
基础大模型学术论文

该论文提出A^2RD架构,将长视频生成为闭环的检索-合成-优化-更新过程,通过多模态视频记忆、自适应段生成与层次化测试时自我改进三个核心组件,解决长期语义漂移与叙事崩溃问题。论文同时发布了一个专门测试非线性实体与环境转换的挑战性基准LVBench-C。在1到10分钟视频的公开与自定义基准上,一致性提升达30%,叙事连贯性提升20%,人工评估也确认了运动与过渡平滑性的显著改善。

arXiv:2605.06924v1 Announce Type: new Abstract: Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
arXiv arXiv cs.CV · 2 天前 · 相关度 88% 热度★★☆☆☆
269
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
RateQuant:基于率失真理论的最优混合精度KV缓存量化
推理部署性能优化学术论文

针对大语言模型推理中KV缓存成为内存瓶颈的问题,本文指出不同注意力头重要性存在显著差异,但现有量化方法对所有头使用统一位宽,忽略了这一差异。作者发现简单的混合精度分配会因不同量化器的失真衰减率β(从3.6到5.3)差异而导致“失真模型不匹配”,反而可能使性能劣于均匀量化。为此提出RateQuant方法,通过小规模校准集为每个量化器拟合失真模型,再利用率失真理论中的反向注水法实现闭式最优比特分配。在Qwen3-8B模型上,以2.5平均比特量化时,RateQuant将KIVI的困惑度由49.3降至14.9(降低70%),并使QuaRot降低6.6 PPL,校准仅需在单个GPU上耗时1.6秒,推理时无额外开销。

arXiv:2605.06675v1 Announce Type: cross Abstract: Large language models cache all previously computed key-value (KV) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit-width to every attention head, ignoring the large variation in head importance. A natural idea is to allocate more bits to important heads and fewer to the rest. We show, however, that such mixed-precision allocation has a hidden pitfall: each quantizer follows a different distortion curve D(b)=alpha*beta^{-b}, and the decay rate beta varies from 3.6 to 5.3 across quantizer designs. Applying one quantizer&#39;s distortion model to another inverts the allocation order and makes performance worse than uniform quantization. We call this failure mode distortion model mismatch and propose RateQuant to resolve it. RateQuant fits a per-quantizer distortion model from a small calibration set, then solves the resulting bit-allocation problem in closed form via reverse waterfilling from rate-distortion theory. On Qwen3-8B at 2.5 average bits, calibrated RateQuant reduces KIVI&#39;s perplexity from 49.3 to 14.9 (70% reduction) and improves QuaRot by 6.6 PPL. The entire calibration takes 1.6 s on a single GPU and adds zero overhead at inference time.
arXiv arXiv cs.CL · 2 天前 · 相关度 88% 热度★★☆☆☆
270
Adaptive Subspace Projection for Generative Personalization
面向生成式个性化的自适应子空间投影
基础大模型学术论文

本文针对生成式个性化中的语义崩塌问题,分析发现语义漂移集中于特定低维子空间,且个性化过程会扰动原始基础概念嵌入,导致参照点不稳定。提出一种无需训练的测试时嵌入调整方法AdaptSP,以稳定的预训练嵌入为锚点,将语义漂移投影至识别出的子空间,进行精确调整以缓解语义崩塌并保持主体一致性。实验表明该方法显著提升了提示忠实度和上下文对齐效果。

arXiv:2605.07257v1 Announce Type: new Abstract: Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.
arXiv arXiv cs.CV · 2 天前 · 相关度 85% 热度★★☆☆☆
271
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
LensVLM: 面向文本压缩视觉表示的选择性上下文扩展
推理部署基础大模型

本文提出LensVLM,一种用于视觉语言模型(VLM)的推理框架与后训练方案,使模型能扫描压缩后的文本图像,并仅对相关区域进行选择性还原。该方法在Qwen3.5-9B-Base上实现了4.3倍有效压缩比下与全文本上限相当的准确率,并在最多10.1倍压缩时仍优于检索式基线与视觉/文本压缩基线。LensVLM还能泛化到多模态文档与代码理解任务,分析表明训练使视觉压缩对渲染选项鲁棒,且随着压缩增大,模型逐渐依赖扩展后的内容而非不可靠的视觉阅读,为工具选择提供了实用指导。

arXiv:2605.07019v1 Announce Type: new Abstract: Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder&#39;s effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.
arXiv arXiv cs.CV · 2 天前 · 相关度 85% 热度★★☆☆☆
272
Fast Byte Latent Transformer
快速字节潜在变换器
推理部署学术论文

本文针对字节级语言模型自回归生成速度慢的瓶颈,提出了三种加速推理的新方法:BLT-Diffusion引入辅助扩散目标,实现每步并行生成多个字节;BLT-Self-speculation利用本地解码器进行自我投机解码;BLT-Diffusion+Verification在扩散生成后添加自回归验证步骤。这些技术在生成任务上的内存带宽成本估计可降低50%以上,显著提升了字节级模型的实用效率。

arXiv:2605.08044v1 Announce Type: new Abstract: Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT&#39;s local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
273
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard:基于模式条件分类的大模型安全防护
推理部署

GLiGuard 是一个仅 0.3B 参数的基于双向编码器的安全防护模型,它将任务定义与标签语义编码为结构化令牌序列,实现单次前向传播即可同时对提示安全、回答安全、拒答检测、14 类细粒度伤害及 11 种越狱策略进行评估。在九个主流安全基准上,其 F1 分数与 7B-27B 的解码器防护模型相当,但模型尺寸缩小了 23-90 倍,吞吐量提升高达 16 倍,延迟降低 17 倍,显著降低了 LLM 内容审核的推理成本。该模型代码与模型权重已开源。

arXiv:2605.07982v1 Announce Type: new Abstract: Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbf{GLiGuard}, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90$\times$ smaller, while delivering up to 16$\times$ higher throughput and 17$\times$ lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at https://github.com/fastino-ai/GLiGuard.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
274
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
内存高效循环Transformer:解耦循环语言模型中的计算与内存
推理部署基础大模型

本文提出内存高效循环Transformer(MELT),解决循环LLM(如Ouro)在增加推理轮次时KV缓存线性增长的问题。MELT采用每层单一KV缓存在所有推理循环间共享,并通过可学习的门控机制更新,从而实现恒定内存占用的迭代推理。训练阶段采用两阶段过程:从LoopLM模型插值过渡到MELT,再进行注意力对齐蒸馏。实验表明,基于Ouro微调的MELT在保持可比参数规模标准LLM的内存开销下,性能优于同尺寸模型,仅需轻量级后训练即可维持循环LM的性能。

arXiv:2605.07721v1 Announce Type: new Abstract: Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro&#39;s. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
275
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
并非所有Token学习方式相同:注意力熵揭示RL推理中的异构信号
训练微调学术论文

本研究通过注意力熵分析强化学习后训练中token级学习信号的异质性,发现低注意力熵的“锚点”token梯度稳定但易在困难基准上饱和,高注意力熵的“探索者”token梯度波动大但可能含有困难推理信号。基于此提出动态熵感知的软重加权干预,将Qwen3-8B-Base在留出集上的平均推理得分从34.39提升至37.40。研究揭示了均匀token平均可能掩盖RL后训练中有意义的异构性,为优化推理能力提供了新视角。

arXiv:2605.07660v1 Announce Type: new Abstract: Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
276
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
通过潜在探索与显式验证的高效测试时推理
推理部署学术论文

针对思维链推理生成大量离散 token 导致开销大的问题,本文提出 LaTER 两阶段范式:先在连续潜在空间进行有界探索,再切换到显式思维链进行验证和答案生成,无需训练即可通过隐藏状态投影、潜在 KV 缓存保留及熵/停止词探针决定切换时机。在 Qwen3-14B 上无训练 LaTER 可将总 token 数降低 16%-32%,AIME 2025 准确率从 70.0% 提升至 73.3% 且 token 数从 15730 降至 10661;进一步构建 Latent-Switch-69K 数据集进行微调后,AIME 2025 准确率达到 80.0%,比标准 CoT 基线高 10 个百分点,同时少用 33% 的 token。

arXiv:2605.07315v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
277
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
PaT:先执行后规划的测试时代码生成高效策略
推理部署性能优化

本文提出Planning-after-Trial (PaT) 自适应策略,用于大语言模型的测试时代码生成。该方法仅在验证失败时调用规划器,避免了对简单问题的不必要规划开销,并支持异构模型配置——由低成本模型负责生成尝试,强大模型进行针对性规划干预。在多个基准与模型家族上,PaT显著推动了成本-性能帕累托前沿,在性能持平大模型的同时将推理成本降低约69%。

arXiv:2605.07248v1 Announce Type: new Abstract: Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
278
Teaching Language Models to Think in Code
教语言模型用代码思考
训练微调推理部署学术论文

本文提出 ThinC(Thinking in Code)框架,将代码本身作为推理主体,而非作为自然语言调用的工具。该框架通过简短的 NL 规划步骤后,所有推理均以代码块及其执行输出连接。作者从教师模型蒸馏了 12.2k 代码中心化推理轨迹,经监督微调和强化学习训练出 ThinC-1.7B 与 ThinC-4B 模型。ThinC-4B 在五个竞赛级数学基准上一致超越所有 TIR 基线,甚至超过大得多的 Qwen3-235B-A22B-Thinking 模型,且 99.2% 的最终答案基于解释器输出,能可靠地从代码执行失败中恢复。

arXiv:2605.07237v1 Announce Type: new Abstract: Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
arXiv arXiv cs.CL · 2 天前 · 相关度 85% 热度★★☆☆☆
279
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE:面向推测性图像解码的上下文感知松弛方法
推理部署学术论文

本文提出CASCADE方法,用于加速自回归图像生成中的推测解码。通过挖掘目标模型隐状态在深度和广度上的冗余模式,形式化语义可互换性和收敛性两个性质,实现对草稿令牌接受的松弛,无需额外训练即可提高接受率。同时利用目标模型的冗余信号改进草稿模型训练。在多种文本到图像模型上实现最高3.6倍加速,保持图像质量和提示词一致性,达到基于草稿模型的推测解码SOTA加速效果。

arXiv:2605.07230v1 Announce Type: new Abstract: Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model&#39;s high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model&#39;s behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model&#39;s hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.
arXiv arXiv cs.CV · 2 天前 · 相关度 82% 热度★★☆☆☆
280
Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
并非所有令牌都需要40步:扩散Transformer中的异质步数分配实现高效视频生成
推理部署学术论文

本文提出一种称为异质步数分配(HSA)的训练无关推理算法,针对扩散Transformer在视频生成中所有token使用相同去噪步数导致计算量过大的问题,基于token的运动速度动态分配不同的去噪预算。HSA通过KV缓存同步机制让活跃token参与完整序列而完全绕过非活跃token,并导出缓存欧拉更新以单次操作推进被跳过token的隐状态,避免额外模型评估。在Wan-2和LTX-2模型上的文本/图像到视频生成实验表明,HSA显著优于先前的缓存方法和普通流匹配基线,尤其在50%和25%运行时间的激进加速下仍能保持结构完整性和生成质量,无需昂贵的离线分析,实现了更优的质量-运行时间帕累托前沿。

arXiv:2605.06892v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that advances the latent states of skipped tokens in a single operation without additional model evaluations. We evaluate HSA on the Wan-2 and LTX-2 models for both text-to-video (T2V) and image-to-video (I2V) generation. Our results demonstrate that HSA significantly outperforms previous state-of-the-art caching methods and the vanilla Flow Matching baseline, especially at aggressive acceleration regimes (e.g., 50% and 25% runtimes). Crucially, HSA achieves a superior quality-runtime Pareto frontier without the need for expensive offline profiling, robustly preserving structural integrity and generation quality even under tight computational budgets. Project page: https://ernestchu.github.io/hsa
arXiv arXiv cs.CV · 2 天前 · 相关度 82% 热度★★☆☆☆
281
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
LLM 改进 LLM:面向测试时扩展的智能体式发现
推理部署性能优化

本文提出 AutoTTS,一种环境驱动的框架,将测试时扩展(TTS)策略从人工启发式设计转向自动发现。通过构建宽度-深度 TTS 控制器合成环境,引入 beta 参数化使搜索可处理,并利用细粒度执行轨迹反馈提高发现效率。实验表明,自动发现的 TTS 策略在数学推理基准上整体准确率-成本权衡优于强人工基线,泛化到新基准和不同模型规模,且发现总成本仅 39.9 美元和 160 分钟。

arXiv:2605.08083v1 Announce Type: new Abstract: Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
282
Uneven Evolution of Cognition Across Generations of Generative AI Models
跨代生成式AI模型认知的不均衡演化
学术论文基础大模型

该研究采用心理测量框架评估多模态生成式AI模型的认知能力,发现其言语理解和工作记忆达到人类98百分位以上,而知觉推理却低于1百分位,呈现极度不均衡的认知架构。通过构建AIQ基准追踪6代模型,揭示语言化抽象推理能力的进步远快于视觉形式,而视觉-知觉组织基本停滞。结果表明当前的规模化与优化路径可能无法克服底层架构限制,难以实现均衡的类人通用智能。

arXiv:2605.06815v1 Announce Type: cross Abstract: The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (&gt;$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (&lt;$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.
arXiv arXiv cs.CV · 2 天前 · 相关度 82% 热度★★☆☆☆
283
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
如何将潜在扩散语言模型与潜在空间联合训练
基础大模型训练微调

本文提出潜在扩散语言模型LDLM,通过联合训练潜在编码器、扩散模型和解码器,基于预训练语言模型的表示构建易于去噪和解码的潜在空间。针对直接联合训练质量差的问题,引入MSE解码器损失、扩散到编码器预热、自适应时间步采样和解码器输入噪声等简单训练配方。在OpenWebText和LM1B数据集上,LDLM生成性能优于现有离散与连续扩散语言模型,同时推理速度提升2~13倍,证明联合学习潜在空间是使潜在扩散在文本生成中具备竞争力的关键步骤。

arXiv:2605.07933v1 Announce Type: new Abstract: Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. In this work, we present the Latent Diffusion Language Model (LDLM), in which the latent encoder, diffusion model, and decoder are trained jointly. LDLM builds its latent space by reshaping the representations of a pre-trained language model with a trainable encoder, yielding latents that are easy to both denoise and decode into tokens. We show that naive joint training produces a low-quality diffusion model, and propose a simple training recipe consisting of an MSE decoder loss, diffusion-to-encoder warmup, adaptive timestep sampling, and decoder-input noise. Ablations show that each component substantially impacts generation performance. On OpenWebText and LM1B, LDLM achieves better generation performance than existing discrete and continuous diffusion language models while being $2{\text -}13\times$ faster, indicating that jointly learning the latent space is a key step toward making latent diffusion competitive for text generation.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
284
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM:通过内部注意力控制实现多模态推理的主动视觉
基础大模型学术论文

本文提出GazeVLM,一种将主动视觉机制内化于多模态大模型的架构。模型通过自主生成凝视令牌(gaze token),对自身的因果注意力掩码施加自上而下控制,动态抑制无关视觉特征,实现空间选择性注意和模拟中心凹注视,从而在局部推理与全局感知之间灵活切换。训练采用群组相对策略优化(GRPO)以奖励有效定位,仅4B参数的GazeVLM在高分辨率多模态基准HRBench-4k和HRBench-8k上,分别超越同参数级的最优VLM约4%、超越基于图像思考的Agent管线5%以上,有效减轻空间推理稀释与语言幻觉问题。

arXiv:2605.07817v1 Announce Type: new Abstract: Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
arXiv arXiv cs.CV · 2 天前 · 相关度 82% 热度★★☆☆☆
285
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA:面向视频扩散模型的语义自适应关系对齐
训练微调学术论文

论文针对视频扩散模型存在的实体丢失、属性错配和交互弱化问题,提出语义自适应关系对齐方法SARA。该方法在现有的Token关系蒸馏基础上,引入文本条件的显著性图,通过一个轻量级第一阶段对齐器学习每个Token对的权重,将监督信号路由到主体-主体和主体-背景对,抑制背景-背景对。在Wan2.2持续训练设置下,SARA在13维VLM评估指标、VBench基准和用户盲测中均优于SFT、VideoREPA和MoAlign,同时提升了文本对齐和运动质量。

arXiv:2605.07800v1 Announce Type: new Abstract: Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.
arXiv arXiv cs.CV · 2 天前 · 相关度 82% 热度★★☆☆☆
286
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
重新思考密集顺序链:推理语言模型可以从稀疏、乱序的思维链中提取答案
学术论文基础大模型推理部署

本文系统研究了推理语言模型的思维链(CoT)特性,通过对多个模型和基准进行移除、掩码、乱序和噪声注入等干预,发现答案提取对顺序和密集性极不敏感。实验表明,行级乱序仅导致准确率下降不足0.5个百分点,词级乱序仍保留62%-89%准确率;掩码数字使准确率归零,而掩码字母文本反而提升4.7个百分点。最激进的稀疏表示(去除所有自然语言、任意打乱行)仍能保持83%准确率,注入3倍频率的假答案不会改变提取结果。研究揭示答案提取操作基于稀疏、顺序不敏感且结构稳健的信息基质,为并行化和token高效推理生成提供了新方向。

arXiv:2605.07307v1 Announce Type: new Abstract: Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, and noise injection--applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No--line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No--masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes--the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%-&gt;83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
287
MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
MedAction:迈向主动多轮临床诊断大语言模型
训练微调学术论文

本文研究了大语言模型在真实临床多轮主动诊断中的三个核心失败模式:无依据的检查指令、诊断更新不可靠以及多轮连贯性退化。为此,提出了MedAction,一种基于树结构蒸馏的流水线,通过模型与环境交互合成高质量多轮诊断轨迹,并引入Disease Trajectory Consistency和Reasoning-Action Consistency两个知识图谱导向指标来筛选轨迹质量。利用该流水线构建了包含32,681条轨迹的MedAction-32K数据集,微调8B模型后在MedR-Bench和新提出的MedAction-300-Hard基准上均取得开源模型中的最优性能。

arXiv:2605.07305v1 Announce Type: new Abstract: Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model&#39;s hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
288
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
SpecBlock:基于动态树草稿的块迭代推测解码
推理部署性能优化

本文提出 SpecBlock,一种块迭代的推测解码方法,结合了自回归草稿的路径依赖性和并行草稿的低成本。草稿器每次前向生成 K 个依赖位置(称为一个块),通过块扩展构建草稿树,并利用层间状态传递和块间状态继承维持路径依赖性。引入合作训练的排名头动态分配分支,并通过有效前缀掩码仅在草稿器可能生成的位置上进行训练。部署时采用成本感知的 bandit 机制,根据验证器反馈选择性更新草稿器。实验表明,相比于 EAGLE-3,SpecBlock 在仅消耗其 44-52% 的草稿成本下,平均加速比提升 8-13%,成本感知自适应将优势扩大至 11-19%。

arXiv:2605.07243v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position&#39;s hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
289
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
重新定义长上下文LLM推理中的KV缓存逐出问题
推理部署性能优化

本文针对大模型长上下文推理中KV缓存带来的内存和延迟开销,提出一种新的逐出策略LaProx。该方法将传统的逐头权重平均方式重新形式化为输出感知的、逐层的矩阵乘法近似问题,显式建模注意力图与投影值状态的乘法交互,从而准确量化token贡献并考虑头间依赖。在此基础上,LaProx赋予token全局可比较的重要性分数,支持模型级统一选择,而非局部的逐头决策。在LongBench和Needle-In-A-Haystack等19个长上下文基准上的实验表明,该方法仅需5%的KV缓存即可维持模型性能,并在极端压缩场景下将精度损失降低最多2倍,显著优于现有方法。

arXiv:2605.07234v1 Announce Type: new Abstract: Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5\% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2$\times$ accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.
arXiv arXiv cs.CL · 2 天前 · 相关度 82% 热度★★☆☆☆
290
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
Sword:面向VLA策略后训练的风格鲁棒世界模型模拟器——基于动态潜在引导
训练微调学术论文基础大模型

本文针对视觉-语言-动作(VLA)模型与世界模型集成中存在的泛化差和长时域误差累积问题,提出鲁棒世界模型框架Sword。该方法通过结构引导的风格增强解耦视觉纹理与任务动态,提升对颜色、光照等扰动的泛化能力;同时采用动态潜在引导策略,在维持训练-推理一致性的同时控制内存开销。在LIBERO基准上的实验表明,Sword在生成质量、鲁棒性、保真度及VLA强化学习后训练成功率方面均显著优于基线WoVR。

arXiv:2605.07288v1 Announce Type: new Abstract: The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within &#34;imagination.&#34; However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
arXiv arXiv cs.CV · 2 天前 · 相关度 80% 热度★★☆☆☆
291
EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration
EULER-ADAS:面向精度可重构近似ADAS加速的节能、SIMD统一的对数Posit引擎
AI芯片硬件推理部署学术论文

本文提出EULER-ADAS,一种基于有界域Posit表示和SIMD统一架构的神经网络计算引擎,专为高级驾驶辅助系统(ADAS)的低功耗、低延迟推理设计。该架构支持Posit-(8,0)、Posit-(16,1)和Posit-(32,2)三种精度模式,通过自适应对数尾数乘法和SIMD共享累加路径实现精度可重构,无需重复硬件。FPGA实现较精确Posit引擎降低最多41.4%的LUT、76.1%的延迟和71.9%的功耗,能耗延迟积较基4 Booth Posit乘法器降低10倍;28nm CMOS下面积0.013-0.016 mm²,功耗19.8-22.1 mW,频率达1.84 GHz。在图像分类、ADAS和边缘推理任务中,Posit-16/32配置相比FP32精度仅下降约1.5%,TinyYOLOv3原型在Pynq-Z2上实现78 ms延迟、0.29 W功耗和22.6 mJ/帧,验证了其在低功耗实时ADAS推理中的适用性。

arXiv:2605.06875v1 Announce Type: cross Abstract: Advanced driver-assistance systems (ADAS) require neural compute engines that deliver low-latency inference under strict power and area constraints. Posit arithmetic is attractive for such accelerators because it provides high numerical fidelity at low precision, but its variable-length regime encoding increases encode/decode cost and exposes the datapath to large regime-field fault effects. This paper presents EULER-ADAS, a SIMD-enabled logarithmic bounded-Posit neural compute engine for energyefficient and reliability-aware ADAS acceleration. The proposed datapath combines bounded-regime Posit representation, stageadaptive logarithmic mantissa multiplication with bit truncation, and a SIMD-shared quire accumulation path supporting Posit- (8,0), Posit-(16,1), and Posit-(32,2) execution. The unified architecture enables 4xPosit-8, 2xPosit-16, or 1xPosit-32 operation without duplicating precision-specific hardware. FPGA implementation shows that the proposed configurations reduce LUT count by up to 41.4%, delay by up to 76.1%, and power by up to 71.9% relative to exact Posit neural compute engines, while achieving up to 10x lower energy-delay product than radix-4 Booth-based Posit multipliers. In 28-nm CMOS, the bounded variants occupy 0.013-0.016 mm2 , consume 19.8-22.1 mW, and operate at up to 1.84 GHz. Application-level evaluation across image-classification, ADAS, and edge-inference workloads shows that the evaluated Posit-16 and Posit-32 configurations remain within about 1.5 percentage points of FP32 accuracy. A TinyYOLOv3 prototype on Pynq-Z2 achieves 78 ms latency at 0.29 W and 22.6 mJ/frame, demonstrating the suitability of EULERADAS for low-power real-time ADAS inference.
arXiv arXiv cs.CV · 2 天前 · 相关度 80% 热度★★☆☆☆
292
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
面向视觉语言模型的无目标幻觉强化反学习
基础大模型训练微调学术论文

本文提出 HFRU,一种针对视觉语言模型的强化反学习框架,直接在视觉编码器上进行深度语义遗忘,以避免传统仅微调解码器导致的浅层遗忘和目标幻觉问题。方法包含两阶段:对齐破坏与基于 GRPO 的复合奖励优化,其中抽象奖励引导语义合理的替换以抑制幻觉。在物体识别和人脸身份任务中,HFRU 实现了超过 98% 的遗忘与保留性能,且几乎不引入目标幻觉,显著优于现有方法。代码已开源。

arXiv:2605.08031v1 Announce Type: new Abstract: Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.
arXiv arXiv cs.CV · 2 天前 · 相关度 80% 热度★★☆☆☆
293
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF:面向指令遵循的自进化强化学习
训练微调学术论文

SEIF 提出一种自进化强化学习框架,通过指令难度进化与模型能力进化相互促进的闭环,持续提升大语言模型的指令遵循能力。框架包含导师(生成渐进式困难指令)、过滤器(剔除冲突或无效指令)、跟随者(学习遵循指令)和评判者(提供强化学习奖励信号)四个角色,导师和跟随者交替训练、共同进化。多模型实验表明 SEIF 能稳定提升指令遵循性能,并揭示了有效的自进化训练策略:前期充分训练打基础,后期适度训练防止过拟合。代码与数据已公开。

arXiv:2605.07465v1 Announce Type: new Abstract: Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model&#39;s capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model&#39;s instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
arXiv arXiv cs.CL · 2 天前 · 相关度 80% 热度★★☆☆☆
294
Learning Agent Routing From Early Experience
从早期经验中学习智能体路由
推理部署基础大模型

论文针对LLM智能体高昂的延迟和计算成本,研究在冷启动条件下将查询路由至轻量级LLM推理或完整智能体执行的问题。提出无训练的BoundaryRouter框架,利用早期行为经验与基于规则的推理构建紧凑经验记忆,在推理时检索相似案例来指导路由决策。同时引入RouteBench基准,实验表明BoundaryRouter相比智能体减少60.6%推理时间,性能比直接LLM推理提升28.6%,显著优于其他路由方法。

arXiv:2605.07180v1 Announce Type: new Abstract: LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.
arXiv arXiv cs.CL · 2 天前 · 相关度 80% 热度★★☆☆☆
295
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS: 面向高效扩散采样的低频率特征噪声整形
推理部署学术论文性能优化

论文针对蒸馏扩散模型在减少去噪步骤后图像质量下降的问题,提出LENS(Low-frequency Eigen Noise Shaping)框架。该方法基于低频率分量主导图像全局结构和视觉保真度的观察,在低维子空间内对噪声进行调制,避免了在高维潜在空间中的高昂计算开销。LENS采用轻量级独立网络进行针对性调制,在实现竞争性图像质量的同时,将FLOPs降低400-700倍,模型参数量减少25-75倍,推理时间开销减少10-20倍,显著提升了扩散模型的采样效率。

arXiv:2605.07253v1 Announce Type: new Abstract: Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700$\times$, model parameters by 25-75$\times$, and inference-time overhead by 10-20$\times$ compared to prior methods.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
296
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg:利用视觉-语言对齐解锁开放世界指代分割
基础大模型学术论文

文章提出Qwen3-VL-Seg,一种参数高效的多模态大模型扩展框架,将模型预测的边界框作为语义结构先验,通过轻量级框引导掩码解码器生成像素级分割结果,仅新增17M参数。为支持开放世界训练,构建了基于SA-1B的大规模指代分割数据集SA1B-ORS(包含类别导向和描述性实例子集),并设计了涵盖分布内/外样本的评测基准ORS-Bench。实验显示,该方法在闭集与开放世界指代分割、视觉定位任务中表现优异,尤其擅长语言密集型指令,且分割适配后仍保持通用多模态能力。

arXiv:2605.07141v1 Announce Type: new Abstract: Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
297
Knowledge Transfer Scaling Laws for 3D Medical Imaging
三维医学影像中的知识迁移缩放定律
训练微调基础大模型学术论文

该论文研究了面向3D医学影像(CT、MRI、PET)的视觉基础模型预训练中的知识迁移缩放定律。发现不同模态的预训练损失和跨模态迁移性能遵循幂律缩放规律,且知识迁移具有强烈非对称性。基于此将数据分配建模为缩放律优化问题,得到的分配方案揭示了“枢纽-孤岛”结构:高可迁移模态作为枢纽应获得更多投入,而孤立模态需直接分配数据。实验表明,迁移感知的分配相比均匀采样提升高达58%,并在不同数据预算下泛化良好,下游疾病分类与器官/病灶分割任务验证了其有效性。

arXiv:2605.06859v1 Announce Type: new Abstract: Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
298
Visual Text Compression as Measure Transport
视觉文本压缩作为度量传输
推理部署学术论文

本文将视觉文本压缩(VTC)建模为度量传输问题,将文本和视觉令牌视为经验概率测度,定义ViT编码器诱导的前推映射,其传输成本可分解为块内聚合的精度成本和跨块碎片的覆盖成本,且均可从下游无标签探针估计。基于此提出无下游标签的路由准则,用于判断给定输入是否使用视觉路径,并设计传输信息驱动的注视机制,对高成本区域以更高分辨率重新编码。在Qwen3-4B上针对24个NLP数据集评估,无标签准则在70.8%的数据集上匹配各数据集最优选择,平均任务得分提升3.3%,同时平均令牌数减少10.3%,有效平衡了压缩比与下游效用。

arXiv:2605.06708v1 Announce Type: new Abstract: Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing $3$--$20\times$ fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across $24$ NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on $17/24$ datasets ($70.8\%$), and improves the average task score by $+3.3\%$ with $-10.3\%$ average tokens relative to a pure-LLM.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
299
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
CA-SQL: 基于探索与计算预算分配的复杂性感知推理时推理方法用于 Text-to-SQL
推理部署学术论文

本文提出 CA-SQL,一种新颖的 Text-to-SQL 推理流水线,通过估计任务难度动态调整候选解决方案的生成广度,并结合进化搜索原理的提示种子方法和新的投票选择策略,在 BIRD 基准的“挑战性”子集上以仅使用 GPT-4o-mini 达到 51.72% 的最优成绩,优于其他上下文学习方法,整体执行准确率达 61.06%,Soft F1 达 68.77%。

arXiv:2605.08057v1 Announce Type: new Abstract: While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the &#34;challenging&#34; tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
300
On the Role of Strain and Vorticity in Numerical Integration Error for Flow Matching
应变与涡量在流匹配数值积分误差中的作用
学术论文推理部署性能优化

本文分析流匹配生成模型中速度场的性质对数值积分误差的影响,将速度雅可比分解为对称部分(应变率)和反对称部分(涡量),证明应变通过对数范数控制指数误差放大,而涡量仅对局部截断误差有线性贡献。研究表明最优传输速度场无旋且物质导数为零,欧拉方法可达二阶精度。基于此提出加权雅可比正则化,在2D合成数据上使5步积分误差降低2.7倍,在CIFAR-10上轻量微调后10步FID提升14%,在低NFE下有效降低推理成本。

arXiv:2605.06680v1 Announce Type: cross Abstract: Flow matching generates data by integrating a learned velocity field, where the number of integration steps (NFE) directly determines inference cost. We analyze which properties of the velocity field govern integration error by decomposing the velocity Jacobian into its symmetric part S (strain rate) and antisymmetric part Omega (vorticity). We prove that strain and vorticity play different roles: strain controls exponential error amplification through the logarithmic norm, while vorticity contributes only linearly to the local truncation error. We further show that the optimal transport velocity field is irrotational and has zero material derivative, implying second-order Euler accuracy; for exact displacement interpolation, the associated Lagrangian particle dynamics are integrated exactly by Euler. Motivated by this analysis, we study weighted Jacobian regularization with strain weight alpha and vorticity weight beta. Experiments on 2D synthetic data confirm the main theoretical predictions, showing up to 2.7x lower integration error at NFE=5. Preliminary CIFAR-10 experiments show consistent trends, with a lightweight fine-tuning procedure improving FID by 14 percent at NFE=10 while preserving high-NFE quality.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
301
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
均值模式尖叫:用于1000层扩散Transformer的均值-方差分裂残差
基础大模型训练微调学术论文

本文针对将扩散Transformer(DiT)扩展到数百层时出现的“均值模式尖叫”(MMS)训练崩溃问题,通过机制审计发现其根因在于残差写回梯度中的均值相干冲击,导致token表示趋于同质化。作者提出均值-方差分裂残差(MV-Split),在保留原始残差路径的同时,通过泄漏的主干均值替换和分离的增益中心残差更新来防止崩溃。在400层单流DiT上,MV-Split阻止了基线模型发生的发散崩溃,且效果优于LayerScale等方法;进一步在1000层DiT上验证了极端深度下的稳定可训练性。

arXiv:2605.06169v1 Announce Type: cross Abstract: Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline&#39;s pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
302
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
超越“我无法满足此请求”:通过标签增强缓解大语言模型中的刚性拒绝
训练微调基础大模型

本文针对大语言模型安全对齐中常见的“刚性拒绝”问题,提出LANCE方法。LANCE通过变分推理进行标签增强,预测连续分布下的细粒度拒绝类别,为精炼模型提供多方向文本梯度,用以中和提示中的有害元素。最终使得LLM在保持安全性的同时,生成自然、灵活的回复,避免生硬拒答。实验表明LANCE在回复的有用性和自然性上显著超越现有基线方法。

arXiv:2605.07883v1 Announce Type: new Abstract: Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to &#34;rigid rejection,&#34; where a general template (e.g., &#34;I cannot fulfill this request&#34;) indiscriminately triggers refusals and severely undermines the naturalness of interactions between humans and LLMs. To address this issue, LANCE is proposed in this paper to ensure safe yet flexible and natural responses via label enhancement. Specifically, LANCE employs variational inference to perform label enhancement, predicting a continuous distribution across multiple rejection categories. These fine-grained rejection distributions provide multi-way textual gradients for a refinement model to neutralize the hazardous elements in the prompt, so that the LLMs could generate safe responses that avoid rigid rejections while preserving the naturalness of interactions. Experiments demonstrate that LANCE significantly alleviates the rigid rejection problem while maintaining high security standards, significantly outperforming existing baseline models in terms of helpfulness and naturalness of responses.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
303
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE:面向复杂图像生成的结构化分解与条件技能编排
基础大模型学术论文

本文提出SCOPE框架,通过结构化规范维护语义承诺,并在生成生命周期中条件性地调用检索、推理和修复技能,解决复杂视觉意图难以忠实实现的问题。为评估承诺级的意图实现,构建了人工标注基准Gen-Arena,提出实体门控意图通过率(EGIP)指标。实验表明,SCOPE在Gen-Arena上EGIP达0.60,显著优于基线,并在WISE-V和MindBench上也取得领先结果,验证了持续承诺跟踪对复杂图像生成的有效性。

arXiv:2605.08043v1 Announce Type: new Abstract: While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
304
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM: 基于连续潜在扩散的语言建模
基础大模型学术论文

TextLDM 将视觉生成中成功的流匹配扩散 Transformer 范式迁移到文本生成,仅需极小的架构改动。其核心是使用基于 Transformer 的 VAE 将离散 token 映射为连续潜在表示,并通过与冻结预训练语言模型进行表示对齐 (REPA) 来提升表示质量。在此基础上,采用与视觉模型完全一致的 DiT 进行流匹配条件去噪。仅使用 OpenWebText2 从零训练,TextLDM 即大幅超越先前扩散语言模型,并在同等设定下达到与 GPT-2 相当的性能,向统一多模态生成与理解的扩散架构迈出实质一步。

arXiv:2605.07748v1 Announce Type: new Abstract: Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
305
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2:桥接语言模型与归一化流实现统一多模态生成
基础大模型学术论文

本文提出STARFlow2,一种基于自回归归一化流(TarFlow)与预训练视觉语言模型(VLM)相结合的统一多模态生成架构。该架构采用Pretzel设计,通过残差跳跃连接垂直交错两个流,两者共享因果掩码和KV缓存机制,消除了文本自回归与视觉扩散之间的结构不匹配。结合深浅流设计和统一FAE潜在空间,文本与图像输出可直接进入KV缓存,无需重新编码,实现了缓存友好的交错式多模态生成。实验表明,该方法在图像生成和多模态理解基准上均取得强劲性能,验证了自回归流作为统一多模态建模基础范式的可行性。

arXiv:2605.08029v1 Announce Type: new Abstract: Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
306
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT:为跨分词器在线策略蒸馏恢复丢失的监督信号
训练微调学术论文

本文针对教师与学生模型使用不同分词器时,在线策略蒸馏(OPD)中因严格逐词匹配而丢失大量教师信号的问题,提出SimCT方法。SimCT通过在共享标记之外引入可由双方分词器实现的短多标记延续,将监督空间扩展至最细粒度的联合可监督单元,从而恢复被丢弃的监督信息,且不改变OPD损失形式。在数学推理和代码生成基准上,三个异质师生对的实验表明,SimCT相比共享词汇OPD及现有跨分词器基线取得一致增益,消融实验证实提升来源于恢复精确匹配所丢失的监督信号。

arXiv:2605.07711v1 Announce Type: new Abstract: On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
307
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
每帧单令牌:重新思考VLA策略世界模型中的视觉带宽
基础大模型学术论文

本文提出OneWM-VLA,用于改进视觉-语言-动作(VLA)模型中的世界模块。该方法通过自适应注意力池化将每帧视觉压缩为单个语义令牌,并在统一流匹配目标下同时生成潜在状态流和动作轨迹,替代分离的解码器设计。实验表明,在冻结的2B参数π0骨干上仅微调14.71M LoRA参数,即可显著提升长程任务成功率,如MetaWorld MT50从47.9%升至61.3%,LIBERO-Long达到95.6%,实物机器人折叠布任务成功率从20.0%升至60.0%。该工作验证了极低视觉带宽下仍可保持世界模型的规划能力。

arXiv:2605.07931v1 Announce Type: new Abstract: Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $\pi_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $\pi_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $\pi_0$).
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
308
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
什么对扩散友好的潜在流形至关重要?面向潜在扩散的先验对齐自编码器
基础大模型学术论文

本文研究潜在扩散模型中 Tokenizer 的潜在空间设计,提出扩散友好潜在流形应具备空间结构一致性、局部流形连续性和全局语义性三个关键属性。基于此,作者设计了先验对齐自编码器(PAE),通过从视觉基础模型提取的先验和扰动正则化,将这些属性转化为显式训练目标。在 ImageNet 256×256 上,PAE 达到 1.03 的 SOTA gFID,且收敛速度比 RAE 快 13 倍,证明了显式组织潜在流形对生成质量的重要性。

arXiv:2605.07915v1 Announce Type: new Abstract: Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.
arXiv arXiv cs.CV · 2 天前 · 相关度 78% 热度★★☆☆☆
309
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
从0阶选择到2阶判断:组合硬化揭示前沿大语言模型的组合性失败
基础大模型学术论文

本文提出了LogiHard形式化框架,通过将多项选择的0阶选择转化为2阶逻辑判断,显著增加推理步骤与思维开销,从而可靠地硬化评测难度。基于IRT和认知排序构建了LogiHard-2k数据集,对12个前沿LLM的评估显示,在组合硬化问题上准确率下降31%至56%,暴露出LLM的多选失败与早停偏差,而人类不存在该问题。零样本迁移至MMLU时准确率从89.84%骤降至42.86%,证明退化源于组合推理差距而非知识缺陷,揭示了训练导致的完整性验证缺失。

arXiv:2605.07268v1 Announce Type: new Abstract: Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
310
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
通过开放权重代理分析器的激活检测大模型幻觉
推理部署学术论文

本文提出一种无需访问生成模型内部结构的幻觉检测方法,利用小型开源“阅读”模型对已生成文本的内部激活进行检测。构建了18种基于Transformer处理过程的特征,包括残差流范数、逐头源文档注意力、熵、MLP激活等,并训练堆叠集成模型。在RAGTruth等五个幻觉数据集上,使用Qwen2.5、Gemma-2、Pythia、LLaMA-3共七种不同规格分析器,AUC比ReDeEP最高提升10.3个百分点,且模型规模从0.5B到9B时AUC仅差2.3个百分点,3B LLaMA甚至优于8B同系列模型,表明较大模型未必更优。

arXiv:2605.07209v1 Announce Type: new Abstract: We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader&#39;s own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP&#39;s token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP&#39;s 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
arXiv arXiv cs.CL · 2 天前 · 相关度 78% 热度★★☆☆☆
311
学术论文基础大模型

本文首次提出针对遥感多模态RAG系统的检索劫持攻击方法CloudWeb。该方法仅在输入图像上叠加参数化的云和雾图案,通过检索导向的优化目标(拉向目标大气证据、抑制源场景证据、强制排名分离并约束自然度)来操纵证据检索,保持检索器、生成器和知识库不变。在7个数据集和5种CLIP风格检索器上的实验表明,攻击将Weather@5指标从0.71%大幅提升至43.29%,并能导致下游生成产生可测量的天气幻觉与语义偏移,揭示了多模态RAG中通过自然外观的大气变化破坏证据检索的实际故障模式。

arXiv:2605.07273v1 Announce Type: new Abstract: Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
312
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
揭示并塑形视觉语言模型中三维场景拓扑的潜在表征
基础大模型学术论文

本文探究了现代视觉语言模型(VLM)是否内在地构建了类似人类认知地图的三维空间拓扑表征。研究发现VLM确实存在潜在拓扑图,但被颜色、形状等非几何语义严重掩盖;作者通过跨场景线性特征提取分离出纯净的空间子空间,并证明其与场景三维高斯核图的拉普拉斯特征映射相对应。基于此几何辨识,提出了一种基于狄利克雷能量的潜在正则化方法,在仅500步的有监督微调后,在真实世界空间基准上取得了最高12.1%的显著提升,优于标准SFT和多种基线。

arXiv:2605.07148v1 Announce Type: new Abstract: Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model&#39;s spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene&#39;s 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
313
Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
将多模态大语言模型引入红外-可见光图像融合质量评估
学术论文基础大模型

本文提出FuScore方法,利用多模态大语言模型(MLLM)进行红外-可见光图像融合的质量评估。它模仿人类视觉感知,输出连续质量分数而非离散等级,从而实现对相似质量融合图像的细粒度区分。通过四个评估子维度的一致性构造逐图像的软标签,并引入三元目标函数(单图像分布监督、源内对Thurstone保真和跨源对Thurstone保真)来提升排序能力。实验表明,FuScore在与人类视觉偏好的相关性上达到了最先进水平。

arXiv:2605.06969v1 Announce Type: new Abstract: Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
314
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet:将人类中心视频学习扩展至百万小时
学术论文训练微调

为解决具身智能领域缺乏大规模、多样化人类活动数据的问题,研究推出了HumanNet数据集,包含100万小时的第一人称和第三人称人类交互视频。该数据集不仅提供原始视频,还提供以交互为中心的标注(字幕、运动描述、手部和身体信号),旨在将互联网视频转化为适用于表征学习、活动理解和运动生成的可扩展资源。验证实验表明,在固定验证数据下,从HumanNet中抽取1000小时第一人称视频继续训练Qwen VLM模型,其效果优于使用100小时真实机器人数据,证明人类中心视频可作为可扩展且低成本的机器人数据替代品。

arXiv:2605.06747v1 Announce Type: new Abstract: Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
315
Stochastic Transition-Map Distillation for Fast Probabilistic Inference
用于快速概率推断的随机转移图蒸馏
推理部署学术论文

本文提出随机转移图蒸馏(STMD),一种无需教师模型的扩散模型推理加速框架。STMD通过蒸馏采样SDE的完整转移图,利用条件均值流模型参数化,实现一步或少数步的随机采样,保留扩散过程的转移结构。该方法省去了预训练教师、双层优化和轨迹模拟,训练效率高、可扩展性强。论文提供了Wasserstein距离下的收敛界作为理论支撑,并在MNIST、CIFAR-10和CelebA图像生成任务上验证了有效性。

arXiv:2605.07661v1 Announce Type: cross Abstract: Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
316
SR$^2$-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning
SR²-LoRA:面向类增量学习的低秩适配中层间关系自校正方法
训练微调学术论文

该论文针对参数高效微调(PEFT)在类增量学习(CIL)中存在的灾难性遗忘问题,提出了一种新颖的层间关系自校正LoRA方法(SR²-LoRA)。作者通过分析发现,新任务学习过程中层间表征关系的漂移会缩小旧任务的分类裕度,导致性能下降。SR²-LoRA利用当前任务样本构建旧模型与新模型的关系矩阵,并对齐其奇异值,从理论上证明这种对齐比逐元素对齐对估计扰动更具鲁棒性。在标准CIL基准上的实验表明,该方法能有效缓解遗忘,且任务数量越多优势越明显。代码已开源。

arXiv:2605.07420v1 Announce Type: cross Abstract: Pre-trained models with parameter-efficient fine-tuning (PEFT) have demonstrated promising potential for class-incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter-layer relation drift, i.e., the progressive disruption of relationships among layer-wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underline{S}elf-\underline{R}ectifying inter-layer \underline{R}elation Low-Rank Adaptation~(SR$^2$-LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter-layer relation drift. Specifically, SR$^2$-LoRA constructs the relation matrices induced by the previous and current models on current-task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry-wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR$^2$-LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \href{https://github.com/FqWan24/SR-2-LoRA}{repository}.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
317
Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
面向人类动作理解的边缘-云协同任务导向通信框架
推理部署学术论文

本文提出一种边缘-云协同的任务导向通信框架TOAU,用于高效的人类动作理解。框架首先在边缘端用单目姿态估计器从视频中提取关节坐标,再通过VQ-VAE将坐标量化为离散运动令牌,仅传输极低码率(每帧9bits)的码本索引,避免隐私泄露。云端使用轻量投影仪将这些令牌对齐到大视觉语言模型的嵌入空间,并通过指令调优实现复杂动作理解。实验表明,相比传统视频编码方案,TOAU将传输负载降至约1%,系统延迟降至约20%,且保持可比的准确率。

arXiv:2605.07354v1 Announce Type: cross Abstract: The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1\% and the system latency to around 20\% compared to video codec-based solutions, while delivering comparable action understanding accuracy.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
318
Normalizing Trajectory Models
归一化轨迹模型
推理部署学术论文

本文提出归一化轨迹模型 (NTM),将扩散模型的反向采样步骤构造为表达力强的条件归一化流,从而在少量步骤(如 4 步)下进行精确似然训练。NTM 在每个步骤内使用浅层可逆块,并在轨迹上结合深度并行预测器,支持从零训练或从预训练流匹配模型初始化,还能通过自蒸馏进一步提升质量。在文本到图像基准上,仅四步采样就达到或超过强基线,同时保留了生成轨迹的精确似然。

arXiv:2605.08078v1 Announce Type: new Abstract: Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model&#39;s own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
319
MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
MatryoshkaLoRA:为大模型微调学习精确的层级低秩表示
训练微调

本文提出 MatryoshkaLoRA,一种受俄罗斯套娃启发的 LoRA 训练框架,通过在低秩适配器间插入固定的对角矩阵 P 来学习层级化低秩表示,使不同子秩均能有效利用梯度信息,支持动态秩选择且精度损失极小。该方法在训练中无需显式采样不同秩即可获得优于 DyLoRA 等动态秩方法的精度-效率平衡,并引入了评估层级适配器性能的 AURAC 指标。

arXiv:2605.07850v1 Announce Type: new Abstract: With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
320
SCENE: Recognizing Social Norms and Sanctioning in Group Chats
SCENE:识别群聊中的社会规范与社制裁
基础大模型学术论文

本文提出了SCENE基准,用于评估大语言模型智能体在多人群聊中识别和适应隐性社会规范的能力。SCENE生成包含隐藏规范的场景,设置规范违反机会并在违反时施加制裁,通过响应消极制裁和从同伴行为中学习规范两个指标衡量模型的自适应能力。作者在SCENE上评估了Claude Opus 4.7、Gemini 3.1 Pro等六个前沿模型和开源模型,结果显示Claude和Gemini在适应隐性规范方面显著优于开源模型。该基准为大模型社交能力的动态交互式评估提供了新的方向。

arXiv:2605.07823v1 Announce Type: new Abstract: Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
321
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D:通过语义聚类与对齐为视觉语言模型构建高效三维表示
学术论文基础大模型

本文针对视觉语言模型在三维空间推理中存在的空间一致性与序列化效率问题,提出Proxy3D方法。该方法仅以视频帧为输入,利用语义和几何编码器提取场景特征,并通过语义感知聚类生成紧凑的三维代理表示。文章还构建了SpaceSpan数据集,采用多阶段训练将三维代理表示与VLM对齐,在更短的视觉序列下,在三维视觉问答、视觉定位及通用空间智能基准上达到竞争或最先进性能。

arXiv:2605.08064v1 Announce Type: new Abstract: Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
322
Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
超越置信度:重新思考大语言模型性能预测的自我评估
基础大模型学术论文

本文针对大语言模型(LLM)自我评估可靠性问题,借鉴人类心理学中的认知评估理论,提出六种评估维度(如努力、能力等)与传统的言语化置信度进行对比。通过在 12 个 LLM 和 38 个任务上的实验,发现与能力相关的维度(尤其是努力)在预测模型失败方面与置信度相当甚至更优,且高估更少、随模型大小更稳定;不同任务下最佳预测维度不同,如推理型任务努力最有效,检索型任务能力与置信度更具优势。研究为提升 LLM 部署的可靠性与安全性提供了一种结构化多维自我评估新思路。

arXiv:2605.07806v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
323
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
基于链式蒸馏的可变大小小语言模型高效初始化
训练微调学术论文

本文提出链式蒸馏(CBD)方法,通过逐步蒸馏构建稀疏的中间锚点模型序列,形成知识传递链。对于任意目标大小的学生模型,可通过相邻锚点的参数插值直接初始化,无需重复访问大型教师模型,显著提升可扩展性。还引入桥接蒸馏以支持跨架构和跨词汇表的迁移。实验表明,仅138M参数的小模型在10B tokens任务语料上无需恢复预训练即可优于从头训练的模型,展示了方法的高效性和通用性。

arXiv:2605.07783v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
324
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
基于语义感知的自适应视觉记忆框架用于流式视频理解
推理部署基础大模型

SAVEMem 是一种无需训练的双阶段框架,旨在解决多模态大模型在流式视频理解中的记忆管理难题。第一阶段通过伪问题库引入轻量语义先验,生成三层流式记忆,使长期保留策略由语义显著性而非视觉相似性驱动;第二阶段针对不同查询动态调整检索范围,利用锚点条件近邻门控和查询-记忆的晚交互选择相关帧。该方法应用于 Qwen2.5-VL 后,在多个基准上取得显著性能提升,并将 128 帧时的峰值 GPU 内存降低 48%。

arXiv:2605.07897v1 Announce Type: new Abstract: Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
325
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
视频理解奖励建模:一个稳健的基准与高性能奖励模型
基础大模型学术论文

本文针对视频理解奖励建模缺乏评估基准和偏好数据的问题,提出统一框架,包括视频理解奖励基准VURB(2100个偏好对,附带长链推理步骤,平均1143个token,采用多数表决评估)和自动构建的大规模偏好数据集VUP-35K。基于该数据集,训练了判别式奖励模型VideoDRM和生成式奖励模型VideoGRM,在VURB和VideoRewardBench上均达到最优性能,并验证了数据提升奖励表现与模型推理能力,在best-of-N测试时扩展中带来显著增益。

arXiv:2605.07872v1 Announce Type: new Abstract: Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.
arXiv arXiv cs.CV · 2 天前 · 相关度 75% 热度★★☆☆☆
326
Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
基于梯度的LoRA秩分配在GRPO下的实证研究
训练微调学术论文

本文研究了将监督微调(SFT)中有效的自适应LoRA秩分配方法迁移到强化学习对齐训练(GRPO)时的表现。实验以Qwen 2.5 1.5B在GSM8K上进行,发现按梯度比例分配秩反而使准确率比均匀分配降低4.5个百分点(70.0%对74.5%)。作者识别出两个关键机制:GRPO下的梯度景观远较SFT平坦,层间重要性比值仅2.17倍(SFT文献中常>10倍);秩的非均匀分配会引发梯度放大效应,扩大重要性差距形成正反馈,使低秩层逐渐失声。结论表明SFT时代的梯度重要性不能直接预测RL下的容量需求,应避免简单迁移。

arXiv:2605.07366v1 Announce Type: new Abstract: Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to &gt;10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
327
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
文本恐怖谷:大语言模型信息检索中的非单调性能退化
基础大模型推理部署学术论文

本论文研究单词边界破坏对LLM目标信息检测能力的影响,发现当在词内插入空白字符时,检测准确率随插入率增加呈U形曲线,作者称之为“文本恐怖谷”。提出“模式转换假说”解释这一现象:LLM在近正常文本中工作在词级模式,在严重碎片化文本中工作在字符级模式,谷底是两种模式都无效的混乱过渡区。四项实验和一项分析支持该假说:上下文学习无法挽救谷底性能;对扰动进行正则化可显著减弱U形;数学推理任务只在Gemini 3.0 Flash上复现U形,更强模型未出现,表明当任务较少依赖精确词汇对齐时效应减弱;分词熵在F1最小值前达到峰值,符合模式冲突解释。该发现揭示了在干净文本基准下不可见、但与噪声文本输入下的LLM真实部署直接相关的脆弱性。

arXiv:2605.07186v1 Announce Type: new Abstract: Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs&#39; detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.
arXiv arXiv cs.CL · 2 天前 · 相关度 75% 热度★★☆☆☆
328
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
通过决策表征转换理解层剪枝后大语言模型的性能崩溃
推理部署学术论文

本文针对层剪枝常引发大语言模型突然性能崩溃的现象,提出从决策表征角度进行分析。作者在选择题任务上引入“决策边际”和“选项频率”两个指标,并设计了迭代剪枝方法,发现网络存在明显的决策转换,分为沉默阶段(无法预测正确答案)和决策阶段(正确预测出现)。实验表明,裁剪决策阶段影响很小,而裁剪沉默阶段会立即导致崩溃,因为破坏了关键的决策转换过程。

arXiv:2605.07271v1 Announce Type: new Abstract: Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.
arXiv arXiv cs.CL · 2 天前 · 相关度 73% 热度★★☆☆☆
329
Do Joint Audio-Video Generation Models Understand Physics?
联合音视频生成模型是否理解物理?AV-Phys Bench 评测基准
基础大模型学术论文

本文提出了 AV-Phys Bench,一个用于评估联合音视频生成模型物理常识的基准,涵盖稳态、事件转换和环境转换三类场景,引入反物理提示测试模型一致性。评测维度包括视觉/音频语义遵循和跨模态物理常识,发现 Seedance 2.0 表现最优,但所有模型在事件驱动过渡和反物理提示下性能骤降。作者还设计了基于多模态大模型和声学测量工具的 AV-Phys Agent 评估器,其排名与人类评分高度一致,揭示了跨模态物理一致性和过渡动态是当前联合生成的核心挑战。

arXiv:2605.07061v1 Announce Type: cross Abstract: Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
arXiv arXiv cs.CV · 2 天前 · 相关度 72% 热度★★☆☆☆
330
How Value Induction Reshapes LLM Behaviour
价值诱导如何重塑大语言模型行为
训练微调学术论文

该论文研究在后训练阶段通过微调向大语言模型注入特定行为价值(如好奇心、同理心、乐于助人等)的意外效应。作者使用从现有偏好数据集中筛选出的价值子集对模型进行微调,系统测量了价值诱导对其他价值表达、模型安全性、拟人化语言使用及多项问答基准性能的影响。实验发现:价值诱导会连带引发相关甚至相反价值的表达;注入正面价值可提升安全性;但所有价值诱导都会增加模型的拟人化语言,使其表现得更具验证性和奉承倾向,可能对用户产生潜在负面影响。

arXiv:2605.07925v1 Announce Type: new Abstract: Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of the people interacting with the model. However, values are complex and inter-related -- inducing one could modify behaviour on another. Further, inducing certain values can make models more addictive or sycophantic through language used in the generations, with a potential detrimental effect on the user. We investigate these and other unintended effects of value induction into models. We fine-tune models using curated value subsets of existing preference datasets, measuring the impact of value induction on expression of other values, model safety, anthropomorphic language, and various QA benchmarks. We find that (i) inducing values leads to expression of other related, and sometimes contrastive values, (ii) inducing positive values increases safety, and (iii) all values increase anthropomorphic language use, making models more validating and sycophantic.
arXiv arXiv cs.CL · 2 天前 · 相关度 72% 热度★★☆☆☆
331
MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing
MAVEN:带步骤内认知审计的多智能体验证-精化网络
推理部署基础大模型

MAVEN提出一种基于黑板架构的多智能体协作框架,通过将推理过程解耦为怀疑者-研究者-裁判的对抗循环,在每一步生成过程中进行显式验证与精化,取代传统的单体推理链。在OpenBookQA、TruthfulQA等四个基准上的实验表明,该方法在多项细粒度指标上优于GEMINI-3.1-Pro及基于共识的基线方法,能够生成结构化、可审计的推理轨迹,并且对不同的骨干LLM均具有通用提升效果,是一种模型无关的推理质量增强方案。

arXiv:2605.07646v1 Announce Type: new Abstract: While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
arXiv arXiv cs.CL · 2 天前 · 相关度 72% 热度★★☆☆☆
332
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
Moltbook Files:一场无害的“垃圾末世”还是人类最后的实验?
训练微调学术论文

论文发布了一个名为Moltbook Files的数据集,包含23.2万帖子和220万评论,源于OpenClaw智能体在类Reddit平台上的大规模交互。通过分析社区结构、情感和语义特性,发现整体情感中性偏积极。研究重点在于用该数据集微调Qwen2.5-14B-Instruct模型,揭示微调后模型真实性从0.366降至0.187,但使用Reddit数据微调也会导致类似下滑。该工作探讨了AI生成内容对下一代模型训练造成的潜在污染和尾部风险,强调在涌现对齐评估中引入控制基线的重要性。

arXiv:2605.07462v1 Announce Type: new Abstract: Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform&#39;s first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.
arXiv arXiv cs.CL · 2 天前 · 相关度 72% 热度★★☆☆☆
333
GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
GRaSp:低数据任务中上下文学习的自动示例优化
学术论文基础大模型

本文提出了一种面向低数据场景的上下文示例自动优化框架 GRaSp。该框架首先生成大规模合成候选池,然后通过聚类与降维进行结构化组织,最后利用遗传算法搜索最优上下文示例组合。此外引入了多样性自适应变异机制,能在种群收敛时从聚类间探索切换为聚类内精细调优。在金融命名实体识别数据集 FiNER-139 上,使用合成数据的 GRaSp 取得了 45.84% 的 micro-F1,优于零样本和随机少样本基线,但合成数据未超越人工标注数据的泛化能力,表明候选池的分布多样性对泛化至关重要。

arXiv:2605.07454v1 Announce Type: new Abstract: In-context learning enables large language models to adapt to new tasks, but their performance is highly sensitive to the selected examples. Finding effective demonstrations is particularly difficult in domain-specific, low-data settings where high-quality examples are scarce. We propose GRaSp, a three-stage framework for automatic in-context example optimization. By first generating a large synthetic candidate pool, then structuring it with clustering and dimensionality reduction, and finally using genetic algorithms to find the optimal in-context examples, the framework shows consistent improvements on the NER task. We also introduce a custom diversity-adaptive mutation mechanism, allowing it to transition from the initial broad inter-cluster exploration to focused intra-cluster refinement as the population converges. We evaluate GRaSp on financial named entity recognition (FiNER-139), comparing synthetic and human-annotated candidate pools across pool sizes of 500 and 5000. With non-synthetic data, GRaSp achieves 45.84% micro-F1, consistently outperforming both zero-shot and random few-shot baselines. Synthetic data matches the random baseline but does not exceed it, suggesting that distributional variety in the candidate pool is critical for generalization.
arXiv arXiv cs.CL · 2 天前 · 相关度 72% 热度★★☆☆☆
334
Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
激活差异揭示后门:稀疏自编码器架构的比较研究
基础大模型学术论文

本文针对大语言模型的后门攻击安全问题,比较了两种稀疏自编码器架构(Crosscoder与Diff-SAE)在隔离后门特征上的表现。实验在SmolLM2-360M模型上,使用基于年份触发的SQL注入后门进行测试,发现Diff-SAE在大多数实验条件下均取得0.40的后门隔离得分(BIS),且精确率达到1.0、无假阳性,远优于几乎完全失败的Crosscoder。结果表明后门更表现为方向性激活偏移而非稀疏特征激活,基于差异的表征方法对后门检测更有效,为大模型安全监控与可解释性工具开发提供了重要启示。

arXiv:2605.07324v1 Announce Type: new Abstract: Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context (&#34;2024&#34; triggers vulnerable code, &#34;2023&#34; triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
arXiv arXiv cs.CL · 2 天前 · 相关度 72% 热度★★☆☆☆
335
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
难以阅读,易于越狱:视觉退化如何绕过MLLM安全对齐
基础大模型学术论文

该研究揭示了多模态大模型(MLLM)在采用视觉上下文压缩时将文本渲染为图像所带来的安全漏洞:降低图像分辨率会显著削弱模型的安全防御能力,即使文本依然可读。作者将此归因于“认知超载”,即解码退化输入所需的注意力资源挤占了安全审计,导致越狱。这一现象在噪声、几何失真等多种视觉扰动下普遍存在。论文提出一种简单的“结构化认知卸载”策略,通过将视觉转录与安全评估解耦的串行流程来缓解风险,为未来MLLM的安全设计提供关键指导。

arXiv:2605.07250v1 Announce Type: new Abstract: Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload&#39;&#39;, hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading&#39;&#39; strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.
arXiv arXiv cs.CV · 2 天前 · 相关度 70% 热度★★☆☆☆
336
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
MIPIAD:基于Qwen-TF-IDF混合与元集成学习的多语言间接提示注入攻击防御
训练微调学术论文

该论文提出MIPIAD防御框架,针对检索增强和工具使用型LLM中的多语言间接提示注入攻击。框架包含基于Qwen2.5-1.5B微调的序列分类器(XLPID)、TF-IDF词法特征以及验证调优的集成学习(晚期融合、堆叠、梯度提升)。在覆盖英文和孟加拉语的5类任务合成基准上,混合XLPID+TF-IDF集成方案取得最佳F1分数0.9205,Boosting集成实现最佳AUROC 0.9378,且集成方法有效减小了跨语言性能差距。

arXiv:2605.07269v1 Announce Type: new Abstract: Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID&#39;s multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla
arXiv arXiv cs.CL · 2 天前 · 相关度 70% 热度★★☆☆☆
337
Homelab setup
家庭实验室搭建
推理部署AI芯片硬件

一位用户计划以约7-8千美元预算升级本地大模型运行环境,提出两种方案:一是添置一台M5 Max 128GB MacBook Pro,与已有的M3 Max通过Exo组成集群,聚合256GB统一内存;二是搭建一台搭载双RTX 5090显卡的机器。讨论聚焦于Apple Silicon的大容量统一内存与NVIDIA GPU的高算力、显存带宽在本地推理大语言模型时的权衡,内容涉及内存容量、性价比、扩展性和软件生态等实际部署考量。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
338
How long should we expect until we get a gguf for ZAYA1-8B
我们还需要多久才能拿到 ZAYA1-8B 的 GGUF 版本
推理部署基础大模型

该帖子围绕开源大语言模型 ZAYA1-8B 的 GGUF 格式量化版本转化进度展开讨论。GGUF 是 llama.cpp 等本地推理框架使用的统一量化格式,直接影响模型在消费级硬件上的部署效率。社区用户关心该模型在 Hugging Face 等平台的量化适配速度,并评估其实际使用价值与热度。这类讨论反映了本地化推理生态中模型分发与格式支持的关键环节。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 80% 热度★★☆☆☆
339
What are the best 40-500 B MoE LLM models now?
当前最佳的40B-500B参数MoE大语言模型有哪些?
基础大模型推理部署

Reddit用户因旧GPU限制需依赖CPU推理,在LocalLLaMA社区发起讨论,询问当前参数范围在40B至500B之间的最佳混合专家(MoE)大语言模型。提问者看重MoE模型有效参数少、运行高效的特点,并已了解Qwen 3.6和Gemma-4等较小模型,希望获取近期发布的更大规模MoE模型的最新信息。讨论涉及本地部署中的模型选型与推理效率权衡。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 72% 热度★★☆☆☆
340
Qwen3.5 2B BF16 vs 4B Q8_K_XL vs 9B Q4_K_XL
Qwen3.5 2B BF16 vs 4B Q8_K_XL vs 9B Q4_K_XL 量化模型选择讨论
推理部署

Reddit用户讨论在通用场景及短代码生成任务中,不同参数量和量化等级的Qwen3.5模型的实际表现差异。对比方案包括2B参数的BF16全精度版本、4B参数的Q8_K_XL 8位量化版本、以及9B参数的Q4_K_XL 4位量化版本。该问题涉及本地部署中的模型量化策略选择,需权衡参数量、量化精度对推理质量与资源占用的影响,是典型的推理部署优化话题。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
341
NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing
NVIDIA AI发布Star Elastic:单一检查点包含30B、23B和12B推理模型,支持零样本切分
基础大模型推理部署

NVIDIA推出Star Elastic方法,可在单个模型检查点中嵌套30B、23B、12B等多个子模型,并通过零样本切分直接提取。其核心使用Gumbel-Softmax训练可学习路由器,动态选择注意力头、Mamba状态空间头、MoE专家、FFN通道等弹性维度,实现不同参数预算下的最优配置。推理时可灵活切换模型规模,共享KV缓存,并支持分层推理:先用小模型快速生成推理过程,再用完整模型评估或生成最终答案,兼顾效率与质量。该技术为解决模型部署中的弹性伸缩和推理效率问题提供了新思路。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 92% 热度★★☆☆☆
342
Is Google’s market share on LLMs bulls**t?
谷歌在大语言模型市场的份额增长是虚假繁荣吗?
行业资讯基础大模型

Reddit用户发帖质疑谷歌在大语言模型市场的份额数据,认为Gemini模型的实际能力远逊于GPT和Claude。讨论指出,谷歌的市场份额可能源于企业因谷歌生态绑定而产生的被动使用,以及用户在谷歌产品生态中的便利性选择,而非模型本身的技术优势。帖子反映了业界对当前LLM市场竞争格局与真实用户偏好的关注。

暂无原文内容
Reddit r/artificial · 3 天前 · 相关度 72% 热度★★☆☆☆
343
开发工具推理部署

作者回顾了一年前发布的MCP服务器开源项目,当时本地大模型工具调用还不够稳定。如今,在Mac Mini上使用Gemma 4或Qwen 3.6等本地模型已能全速驱动该服务,实现原生工具调用且完全免费。该项目吸引了大量合并请求和问题讨论,反映出本地LLM工具调用技术已快速成熟。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 75% 热度★★☆☆☆
344
Higher quants are so much better
更高的量化精度表现好得多
推理部署性能优化

在本地部署的行业政策推理基准测试中,bf16全精度模型表现优异,而4比特量化(Q4)模型几乎不可用,凸显了在复杂推理场景下量化精度对模型可靠性的决定性影响。该发现强调了针对推理任务选择合适的量化级别(如GPTQ、AWQ、FP8等)至关重要,尤其当模型需要处理精密策略时,低比特量化可能导致严重性能退化。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 85% 热度★★☆☆☆
345
Locally running Mistral on an i7 from 2017 so I don't waste water or ram
在2017年的i7上本地运行Mistral,以免浪费水或内存
推理部署

该帖子展示了在2017年款Intel i7 CPU上本地运行Mistral大语言模型的场景,体现模型轻量化部署能力。用户强调通过本地推理可避免训练和云端推理产生的水资源消耗和内存浪费,属于典型的消费级旧硬件上的边缘推理实践,反映了模型推理向端侧延伸的趋势。

暂无原文内容
Reddit r/artificial · 3 天前 · 相关度 75% 热度★★☆☆☆
346
Is agentic AI governance even a computationally bounded process?
代理人工智能治理是否是一个计算上有界的过程?
训练微调行业资讯

该讨论从理论计算机科学角度探讨代理人工智能(Agentic AI)治理的可计算性问题,质疑图灵机是否足以覆盖所有已知和未知的治理需求。焦点包括目标漂移、目标不一致等对齐挑战,并引入“未知的未知”概念,分析严格治理流程能否穷尽全部潜在风险。核心关切在于对齐问题是否本质上是不可预测和不可计算的,涉及AI安全基础理论的探讨。

暂无原文内容
Reddit r/artificial · 3 天前 · 相关度 72% 热度★★☆☆☆
347
Running Minimax 2.7 at 100k context on strix halo
在 Strix Halo 上以 100K 上下文运行 Minimax 2.7 的实践经验
推理部署性能优化

本文分享了在 Strix Halo 边缘设备上部署 MiniMax-M2.7 GGUF 量化模型,并通过 llama-server 精细调参实现 100K 长上下文稳定推理的实战经验。核心优化手段包括禁用上下文偏移和内存映射、启用统一 KV 缓存以降低显存占用、限制缓存仅存于显存而不交换到 RAM、增大批处理并行度提升预填充性能,以及实施智能缓存重用策略。这些配置组合为边缘端高上下文推理提供了一套可复现的参数调优路径,有效解决了长上下文场景下的显存与吞吐量平衡问题。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 78% 热度★★☆☆☆
348
ds4 webui
ds4 WebUI
推理部署开发工具

开源项目为 antirez 的 ds4.c 服务器提供了一个极简 Web 用户界面。该界面在配备 256GB 内存的 Apple M3 Ultra 芯片上运行,使用 q2 量化的小型模型,推理速度较快。运行要求为至少 128GB 内存的 Apple Silicon Mac,并展示了多轮提示测试的良好效果。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
349
推理部署性能优化

作者在Apple Silicon M5 Max上测试了Qwen 27B稠密模型的不同量化方法推理速度,标准量化下仅24 Tok/s,采用JANG 4M量化后速度提升至29-30 Tok/s,增幅约30%,无需草稿模型。结果指出JANG量化是苹果硬件上运行大语言模型的优选方案,并提供了基准测试链接。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 78% 热度★★☆☆☆
350
Optimizing workflow concurrency on Mac/omlx?
优化 Mac/omlX 上的工作流并发
推理部署性能优化

用户在 Mac 上使用 omLX 同时运行多个模型工作流时发现,并发导致 tokens/秒显著下降,即使小模型也会占满所有 GPU 核心。用户询问能否限制推理运行器只使用部分 GPU 核心,以提升多工作流整体吞吐量。该问题涉及本地大模型推理的并发调度策略和 GPU 资源隔离机制,反映在边缘设备上优化推理效率的工程需求。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 75% 热度★★☆☆☆
351
I am overwhelmed by Harnesses
我被各种 Harness(封装工具)搞晕了
推理部署开发工具

Reddit 用户抱怨围绕 llama.cpp 涌现的前端封装工具(harness)繁多且质量不一,导致选择困难。部分工具功能不稳定,在与 Claude Code 集成时反而引发更多问题。该讨论反映了本地大模型推理部署中工具链碎片化的实际痛点,开发者期望出现一种统一且稳定的封装方案。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 72% 热度★★☆☆☆
352
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
LLM排行榜并非线性阶梯:基于传递性基准图的实验结果
学术论文基础大模型

该研究将多个LLM基准测试结果转化为有向图,分析模型间的传递性胜出关系。实验表明,94.2%的弱模型可通过2-3跳路径胜出强模型,且存在大量弱模型在单一基准上直接超越强模型的“反转三元组”。某些基准(如Humanity's Last Exam、IFBench、AIME 2025)更容易产生反转,说明传统线性排名无法全面反映模型能力,该图结构视角为模型评估提供了新方法。

暂无原文内容
Reddit r/MachineLearning · 3 天前 · 相关度 75% 热度★★☆☆☆
353
Apple Removes 256GB M3 Ultra Mac Studio Model From Online Store
苹果从在线商店下架256GB M3 Ultra Mac Studio机型
AI芯片硬件推理部署

苹果已从在线商店移除配备256GB统一内存的M3 Ultra Mac Studio,使得该系列最高可选内存降至96GB。这一调整令本地大语言模型社区感到担忧,因为苹果的统一内存架构允许GPU直接访问系统内存,大容量内存对本地运行70B以上大模型至关重要。此前,512GB和256GB版本已先后被移除,内存上限持续下降,用户担心未来的M5 Ultra可能进一步受限,从而削弱Mac作为本地AI推理平台的吸引力。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
354
推理部署基础大模型

该PR为llama.cpp推理框架新增对Sarvam混合专家(MoE)模型的支持,涵盖30B和105B两个版本。Sarvam-30B拥有24亿非嵌入活跃参数,擅长推理、编码和多语言对话,专为资源受限环境设计;Sarvam-105B拥有103亿活跃参数,在复杂推理、智能体任务、数学和编码上达到同尺寸最优,并覆盖22种印度语言。此适配使用户能在本地高效运行这两款先进MoE模型,扩展了llama.cpp的多语言大模型部署能力。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 90% 热度★★☆☆☆
355
Is SillyTavern the most underrated frontend? Could it be an interface with potential trapped in a silly name? Or is it just for a niche?
SillyTavern 是最被低估的前端吗?它可能是被名字耽误的潜力接口,还是只服务于小众?
开发工具

SillyTavern 尽管界面老旧且名称娱乐化,但实际拥有强大的 LLM 配置管理与插件扩展能力。其核心特色“角色”架构支持为每个角色独立设置系统提示,在群聊场景中能让多个专家角色(如心理学家、程序员、哲学家)保持领域不混淆,从而提升用单一大模型进行多专家协作的效率,突破了传统前端手动切换系统提示的模式。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 78% 热度★★☆☆☆
356
9070xt inference for q3 qwen 27B
使用 AMD Radeon 9070 XT 推理 Qwen 27B 3 位量化模型
推理部署性能优化

用户在 llama.cpp 上使用 AMD Radeon 9070 XT 显卡运行 Qwen 3.6 27B 参数的 Q3_K_S 量化模型,全层 GPU 加速、65k 上下文并开启 KV 缓存量化到 q4_0,提示处理速度约 12 tokens/s。用户希望评估该速度是否正常,并寻求优化推理吞吐的方法,涉及本地推理参数调优与性能基准。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
357
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)
BeeLlama.cpp: 高级DFlash与TurboQuant,支持推理与视觉。单张RTX 3090运行Qwen 3.6 27B Q5,200k上下文,比基线快2-3倍(峰值135 tps)
推理部署性能优化

BeeLlama.cpp 是 llama.cpp 的一个分支,集成了 DFlash 注意力优化和 TurboQuant 快速量化技术,旨在解决消费级 GPU 上大模型部署的 VRAM 和效率瓶颈。项目在单张 RTX 3090 上以 Q5 量化运行 Qwen 3.6 27B 模型,支持 200k 超长上下文、推测解码及视觉功能,并针对 Windows 环境进行了优化。实验显示,相比基线推理速度提升 2-3 倍,峰值生成速度达到 135 tokens/s,有效展示了将大参数模型压缩并高效运行在本地硬件的实用性。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 85% 热度★★☆☆☆
358
The many sides of Mimo v2.5 Pro
Mimo v2.5 Pro 的多面性
基础大模型

用户测试大模型 Mimo v2.5 Pro 时发现行为严重不稳定:简单提示生成 3D 地球 HTML 页面耗时 10 分钟且结果极差,而通过角色扮演(模拟苹果网页设计师)进行自我批判后输出改善,但随后模型陷入工具调用的无限循环,不断下载 JavaScript 并破坏鼠标控制,即使明确要求停用工具也无效。该测试直接暴露了模型在遵循指令、工具调用控制和角色依赖等方面的潜在缺陷,属于大模型能力与行为分析的重要案例。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 85% 热度★★☆☆☆
359
Repository of shitty literature?
劣质故事文本库?
基础大模型推理部署

r/LocalLLaMA 用户发起讨论,寻求收集质量较差的故事或文学文本,用于测试本地部署的大语言模型在给出批评意见时是否存在过度吹捧或虚假夸奖的问题。这项实验旨在通过低质量输入考察模型输出的真实性与对齐程度,避免人类作者因个人偏见而高估自身作品,属于对LLM行为评估和本地模型部署验证的实用探索。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 70% 热度★★☆☆☆
360
vLLM + NVFP4 + Qwen3.6 27B: "Checkpoint does not provide a q scaling factor"?
vLLM + NVFP4 + Qwen3.6 27B: “检查点未提供 q 缩放因子”?
推理部署性能优化

用户在使用 vLLM 加载 NVFP4 量化的 Qwen3.6 27B 模型时,遇到“检查点未提供 q 缩放因子”的警告,该缩放因子被默认设为 k_scale,且仅影响 FP8 注意力后端。该问题反映了 NVFP4 格式量化模型在 vLLM 推理引擎中的兼容性缺陷,可能源于量化过程中未保存所需的缩放参数,属于推理部署中的量化参数配置和引擎适配问题。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 82% 热度★★☆☆☆
361
Should we use a non-thinking model for code after using a thinking one for plan? (Agentic coding)
是否应在使用思考模型完成规划后,用非思考模型编写代码?(代理编码)
推理部署开发工具

Reddit 用户探讨了一种代理编码策略:先使用 Qwen3.6 27B 思考模型进行任务规划,再切换至 Qwen3.6 35B A3B 模型执行代码生成。核心问题是在从规划交接给编码时,是否应临时禁用代码模型的思考功能,使其严格遵循规划指令,之后再重新开启思考以处理新工具和输入信息。该讨论涉及在本地消费级 GPU(RX 6800)上部署 Qwen 系列模型,并动态调整推理模式以平衡指令遵循与自适应能力,是 LLM 代理工作流中推理部署优化的实践探索。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 75% 热度★★☆☆☆
362
More Qwen3.6-27B MTP success but on dual Mi50s
更多Qwen3.6-27B多token预测成功案例:双AMD MI50上的实践
推理部署性能优化芯片软件栈

本文分享了在双AMD MI50 GPU上运行Qwen3.6-27B模型的多token预测(MTP)加速实践。通过使用llama.cpp的社区叉并移植MTP模块到已有量化模型,实现了1.5倍单GPU推理速度提升,张量并行下最高达2倍加速。文中给出了具体的服务器启动参数与基准测试数据,展示了MTP技术在老旧计算资源上仍能显著提升推理效率。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 88% 热度★★☆☆☆
363
Is NVMe, good for swap ram?
NVMe用作swap内存对于运行大语言模型好吗?
推理部署

一个Reddit用户在仅有20GB RAM和4GB VRAM的硬件上尝试运行100B+参数的大语言模型,计划使用150GB NVMe SSD作为交换空间来扩展可用内存。该思路属于低资源环境下的本地推理部署技巧,通过将模型权重部分或全部置于NVMe swap中来弥补物理内存不足,但实际推理速度将严重受限于NVMe的读写带宽,导致极高的延迟。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 72% 热度★★☆☆☆
364
5 enterprise AI agent swarms (Lemonade, CrowdStrike, Siemens) reverse-engineered into runnable browser templates.
5个企业级AI代理集群(Lemonade、CrowdStrike、西门子)被逆向工程并转化为可运行的浏览器模板
开发工具行业资讯

该项目逆向工程了Lemonade、CrowdStrike和西门子等企业的多代理AI集群架构,并在浏览器中构建了可视化的可运行节点图模板。模板覆盖保险理赔、网络安全等场景,包含视觉代理、策略代理、欺诈检测代理等多代理协作逻辑。项目旨在弥合独立开发者构建的简单聊天机器人与企业复杂多代理集群之间的差距,提供无需大量代码即可体验的代理编排沙盒。

暂无原文内容
Reddit r/artificial · 3 天前 · 相关度 75% 热度★★☆☆☆
365
buying mac vs building PC for running local LLM
购买Mac vs 组装PC以运行本地大语言模型
AI芯片硬件推理部署

社区讨论全栈开发者选择MacBook Pro M5 Max(128GB统一内存)与自组PC在本地LLM推理场景下的硬件方案对比。关注点包括Mac的统一内存架构对大模型加载的优势、GPU显存限制、内存带宽、CUDA生态支持以及整体性价比。用户寻求实际经验以做出高性价比的AI推理硬件投资决策。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 70% 热度★★☆☆☆
366
Spec decoding for minimax m2.7?
MiniMax M2.7 的推测解码方案?
推理部署性能优化

一位LocalLLaMA用户询问MiniMax M2.7大模型的推测解码(speculative decoding)实现方法。由于该模型未见公布多令牌预测(MTP)功能,发帖者寻求使用EAGLE3框架或蒸馏变体进行推测解码的经验与效果反馈。讨论聚焦于通过投机采样加速模型推理,属于典型的推理部署优化需求。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 80% 热度★★☆☆☆
367
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
在12GB显存上实现80 tok/sec和128K上下文:Qwen3.6 35B A3B与llama.cpp MTP方案
推理部署性能优化

作者在12GB显存的RTX 4070 Super上,通过llama.cpp的多令牌预测(MTP)草稿机制,使Qwen3.6 35B A3B量化模型达到超过80 tokens/秒的生成速度,且草稿接受率高于80%。关键配置包括从源码编译启用MTP的llama.cpp、使用Q4_K_XL量化的GGUF文件,并通过参数 `-fitt 1536` 优化显存与上下文长度,最终实现128K上下文的本地高效推理。该方案为消费级GPU运行大参数模型提供了可复现的优化路径。

暂无原文内容
Reddit r/LocalLLaMA · 3 天前 · 相关度 88% 热度★★☆☆☆
368
Pi and Qwen3.6 27B make setting up Archlinux really easy.
Pi与Qwen3.6 27B使Archlinux设置变得非常简单
开发工具推理部署

用户在Archlinux系统上使用Pi编程代理结合本地部署的Qwen3.6 27B模型,通过自然语言指令自动完成蓝牙连接、屏幕分辨率调整等配置任务。代理能够理解用户意图并执行命令,涉及sudo时提示用户授权,展示了本地AI代理在系统配置与个人计算中的实用潜力。该案例体现了AI代理框架与本地大模型结合的易用性,并探讨了未来加入语音输入与更高权限的交互模式。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
369
GPT-5.5 may burn fewer tokens, but it always burns more cash
GPT-5.5可能消耗更少token,但总是烧掉更多钱
推理部署行业资讯

文章分析了OpenAI的GPT-5.5模型在推理成本上的矛盾:虽然模型本身可能优化了token消耗,但整体运营和使用成本反而更高。这揭示了前沿大语言模型在追求更高效推理时,仍需大量算力资源支撑,导致实际经济性面临挑战。报道从商业化角度探讨了AI公司在模型能力提升与财务可持续性之间的平衡难题。

暂无原文内容
Reddit r/artificial · 4 天前 · 相关度 75% 热度★★☆☆☆
370
Model(s) for Creative Writing & Conversational Intuition
用于创意写作与对话直觉的模型
基础大模型训练微调

作者比较了Qwen系列模型与Claude Sonnet 4.6在创意写作和深层语义关联上的表现,指出Qwen在编程任务上突出但写作与对话直觉仍落后于Anthropic模型。Claude的回复简洁、有主动提问能力,而其他模型往往冗长。作者尝试通过Qwopus微调改善对话风格,但发现微调容易破坏思维链并降低整体质量,凸显了在微调对齐与保持推理能力之间取得平衡的实际挑战。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
371
Testing MiMo-V2.5-IQ3_S with 1'048'576 context
测试MiMo-V2.5-IQ3_S在1048576上下文长度下的表现
推理部署性能优化

本文测试了MiMo-V2.5模型经过IQ3_S量化后,在1048576长度超长上下文下的本地推理表现。测试环境为双GPU(RTX 6000 96GB + W7800 48GB),通过llama-server加载全部49层并启用flash-attn。当上下文长度扩展至346k时,设置temperature=0.2和重复惩罚系数=1.1可有效抑制循环重复问题,且处理速度衰减优于MiniMax模型,预填充与生成阶段保持较好的稳定性,验证了大上下文量化模型在本地部署的可行性。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 82% 热度★★☆☆☆
372
Has anyone set a local LLM up as a language learning tool?
有人将本地 LLM 设置为语言学习工具吗?
基础大模型开发工具

发帖者计划利用本地部署的大语言模型构建德语学习工具,实现对话练习、错误纠正和教学指导等功能。关键挑战在于设计有效的系统提示词,使模型不仅能翻译,还能对德语输入进行语法和用词纠错,并提供解释。社区成员分享本地LLM在语言教学中的实践经验,探讨提示词工程与模型选择。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 72% 热度★★☆☆☆
373
Running a quantized 72B VLM on M4 Pro for GUI tasks — some numbers
在 M4 Pro 上运行量化 72B 视觉语言模型用于 GUI 任务——实测数据
推理部署AI芯片硬件

作者在 Mac mini M4 Pro 48GB 统一内存上,以 w4a16 量化运行 72B 参数的 Mano-P 视觉语言动作模型,用于 GUI 自动化任务。预填充速度约 476 tok/s,解码速度约 76 tok/s,峰值内存仅 4.3GB,模型在 OSWorld 基准上达到 58.2% 准确率,优于其他模型。结论是对于循环式截图-思考-操作的任务,稳定低延迟比原始算力更关键,内存带宽成为主要瓶颈。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 85% 热度★★☆☆☆
374
Hardware upgrade advice
硬件升级建议咨询
推理部署

一位软件开发者考虑将RTX 3080 Ti 12GB升级为两块RTX 5060 Ti 16GB显卡(总显存32GB GDDR7)用于本地大模型推理,预算约1000欧元。该用户正在Arch Linux上使用llama.cpp运行推理,询问双卡能否协同工作以及是否需要桥接器进行配对,重点关注消费级显卡的显存扩展方案与兼容性。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
375
DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]
DeepSeek V4 完整论文发布:FP4 量化感知训练与稳定性技巧详解
基础大模型推理部署训练微调

DeepSeek V4 技术报告完整公开,核心创新包括 FP4 量化感知训练方案:训练后期对 MoE 专家权重直接进行 FP4 QAT,并将 CSA 索引器的 QK 路径激活也量化为 FP4,保持 99.7% 召回率的同时实现 QK 选择器 2 倍加速,推理时完全使用 FP4 权重。V4-Pro 与 V4-Flash 在 1M 上下文下的计算量分别为 V3.2 的 27% 和 10%,KV 缓存仅需 10% 和 7%,大幅降低推理成本。训练稳定性方面,提出“前瞻路由”与 SwiGLU 限幅,前者通过解耦主模型与路由器更新阻断反馈环路,后者对 SwiGLU 路径施加硬限幅抑制异常值,并引入生成式奖励模型提升对齐质量。

暂无原文内容
Reddit r/MachineLearning · 4 天前 · 相关度 95% 热度★★☆☆☆
376
Qwen doesn't work for free
通义千问不再免费
行业资讯基础大模型

Reddit 本地部署社区讨论 Qwen 模型从开源免费转向收费模式或 API 调用开始计费。帖子反映出基础大模型商业化趋势以及用户对成本变化的关注,涉及通义千问系列模型的服务策略调整。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 72% 热度★★☆☆☆
377
We built and open-sourced Caliby: An embedded, high-performance vector database for AI Agents (Beats pgvector by 4x, outperforms FAISS on disk)
我们构建并开源了Caliby:一款面向AI Agent的嵌入式高性能向量数据库(磁盘性能超出FAISS,比pgvector快4倍)
开发工具

Caliby是一个由Sea-Land AI与MIT联合开发的开源嵌入式向量数据库,针对AI Agent和RAG场景优化,支持文本与向量混合存储。它提供HNSW、DiskANN、IVF+PQ等多种索引,磁盘场景下性能比pgvector提升4倍,并优于FAISS。该工具采用C++核心与Python绑定,利用CPU SIMD加速且无外部依赖,可通过pip直接安装,适合轻量级嵌入式AI应用开发。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 80% 热度★★☆☆☆
378
potentially stupid problem trying to llama-bench Qwen3.6-27B across two V100s in llama.cpp
尝试在llama.cpp中使用两张V100进行Qwen3.6-27B量化模型推理基准测试时遇到多GPU未生效问题
推理部署性能优化

用户在llama.cpp中使用llama-bench对unsloth量化的Qwen3.6-27B GGUF Q8_0模型进行跨两张NVIDIA V100显卡的推理性能评测,但发现模型未按预期分割到双卡,而是依次在单卡上完成不同上下文长度的测试。用户已启用CUDA后端、flash attention并指定了"--device CUDA0,CUDA1"参数,但多卡配置未生效,因此向社区求助正确的多GPU推理配置方法。该问题涉及大模型在多GPU环境下的部署优化与推理性能评估。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 80% 热度★★☆☆☆
379
What llamacpp's webui has and what it lacks
llama.cpp Web界面:优势与不足
推理部署开发工具

作者评价了llama.cpp的Web界面,其核心优势是实时显示上下文token使用量,帮助避免模型因上下文饱和而性能下降。界面存在工具调用失败导致整个对话中断、缺乏文件夹或项目管理系统等问题。对MCP工具的控制不够灵活,无法方便地隐藏不需要的工具。为此作者提供了一个MCP代理过滤器代码示例,通过拦截请求隐藏部分工具并过滤目录遍历操作,以节省上下文窗口。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
380
I just generated 1,000,000 transaction graph visualizations from real Ethereum/Arbitrum/Polygon data — now training a Vision-Language model to detect DeFi attacks[P]
我基于真实链上数据生成了100万张交易图可视化图像——现正训练视觉语言模型检测DeFi攻击[项目]
训练微调基础大模型

作者使用AMD MI300X GPU并行处理真实的以太坊、Arbitrum、Polygon链上交易数据,生成了100万张标注了攻击拓扑(如 DRAIN_STAR、MIXING_CHAIN)的交易图可视化图像数据集。在此数据集上,采用LoRA方法微调Qwen2-VL-7B视觉语言模型,使其能够通过“观察”图模式自动检测DeFi攻击,替代传统静态规则。该模型将作为安全预言机Sigui的视觉核心,并配套提出了以太坊AI Agent身份与威胁注册标准ERC-8259。

暂无原文内容
Reddit r/MachineLearning · 4 天前 · 相关度 72% 热度★★☆☆☆
381
How long for llama.cpp official support of MTP?
llama.cpp 官方支持 MTP 还需多久?
推理部署

用户在 Windows 11 的 Strix Halo 平台上尝试自行编译 llama.cpp 时遇到 cmake 错误,未能成功启用 Vulkan/HIP 后端与 MTP(多 Token 预测)功能。该问题反映了 llama.cpp 在支持新硬件后端和新兴推理优化技术(多 Token 预测)时的工程进展与社区需求,属于典型的本地推理部署中框架功能支持的工程问题。MTP 是一种提升自回归推理速度的技术,llama.cpp 的官方支持对在消费级硬件上高效运行大模型有重要意义。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 85% 热度★★☆☆☆
382
Which finetunes are actually worth it?
哪些微调版本真正有价值?
训练微调基础大模型

文章探讨当前大模型微调的实际价值。早期微调多针对角色扮演等特定任务,现在常见的是基于Claude Opus的蒸馏或去抑制微调。作者质疑这些蒸馏模型因数据集小且基座模型可能已包含相似训练数据,导致性能提升有限甚至退化。发帖人询问社区是否有显著优于基座模型的微调版本,并邀请分享角色扮演、编程等场景中的使用体验。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
383
FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
FinRAG-12B:银行业生产验证的基于引用的问答方案
训练微调学术论文行业资讯

本文提出一个面向银行领域的数据高效大模型训练框架,仅用1.43亿tokens数据,通过LLM-as-a-Judge过滤、引用标注和课程学习,使12B模型在引用定位上超越GPT-4.1。引入校准拒绝机制,在训练中混入22%不可回答问题,将“我不知道”率从基础模型的4.3%提升至12%,有效避免GPT-4.1的过度拒绝(20.2%)。该方案已部署在40多家金融机构,查询解决率提高7.1个百分点,响应速度是GPT-4.1的3-5倍,成本仅为其1/20到1/50。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 95% 热度★★☆☆☆
384
ZAYA1-8B Technical Report
ZAYA1-8B 技术报告
基础大模型推理部署芯片软件栈

ZAYA1-8B 是一个基于 MoE++ 架构的推理专用模型,总参数 8B、激活参数仅 700M。模型全流程在 AMD 全栈(计算、网络、软件)上完成预训练、中期训练与监督微调,并采用答案保留裁剪策略从预训练起融入推理数据。后训练通过四阶段强化学习级联提升推理能力,包括数学谜题热身、400 任务的 RLVE-Gym 课程、结合测试时计算轨迹的 RL,以及对话指令遵循训练。测试时引入 Markovian RSA 方法,通过递归聚合并行推理轨迹的尾部,显著提升推理性能,在 AIME'25 上达到 91.9%、HMMT'25 上达到 89.6%,缩小了与更大推理模型的差距。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 95% 热度★★☆☆☆
385
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe:AI智能体能否构建定制化的大模型推理服务系统?
推理部署开发工具学术论文

本文提出了 VibeServe,首个利用多智能体循环自动合成端到端定制化大模型推理服务系统的框架。其外部循环规划系统设计,内部循环生成候选系统、验证正确性并测量性能。在标准部署场景下,VibeServe 与高度优化的 vLLM 保持竞争力;在六个涉及非标准模型架构、工作负载知识与硬件特定优化的场景中,VibeServe 利用通用系统忽略的优化机会实现了性能超越。该工作展示了推理基础设施可以从“运行时通用化”转向“生成时专用化”的设计范式,为解决非标场景下的推理部署痛点提供了新路径。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 90% 热度★★☆☆☆
386
Large Vision-Language Models Get Lost in Attention
大型视觉语言模型在注意力中迷失
基础大模型性能优化学术论文

本文针对大型视觉语言模型(LVLM)的解码器结构,提出一个基于信息论与几何的统一框架,量化残差更新的几何和熵特性。研究发现注意力层实际充当子空间保持算子,仅进行重配置,而前馈网络才是驱动语义创新的子空间扩展算子。实验进一步表明,将可学习的注意力权重替换为预定义值(如高斯噪声)后,多数数据集上的性能不变甚至更优,揭示了现有注意力机制的严重错配与冗余,意味着先进LVLM并未有效利用视觉上下文信息。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 88% 热度★★☆☆☆
387
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
浅层预填充,深层解码:通过层非对称KV可见性实现高效长上下文推理
推理部署性能优化学术论文

本文提出SPEED方法,采用分阶段层非对称的KV缓存可见性策略,仅在模型浅层保留非锚点提示词的KV状态,解码阶段词元保持全深度可见。在Llama-3.1-8B指令微调模型上,仅用75%的层处理预填充词元,长上下文基准得分接近全深度基线,同时TTFT降低33%,TPOT降低22%,128K上下文长度下活跃KV内存减少25.0%。逐层诊断表明该方法保留了模型关键的提示选择与表示稳定区域,证明了长上下文提示词元无需始终作为全深度KV缓存对象。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
388
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
多不一定更好:LLM智能体脚手架中的跨组件干扰
开发工具学术论文

本文系统研究了大型语言模型(LLM)智能体中规划、工具、记忆、自我反思和检索等脚手架组件之间的交叉干扰。通过在全因子实验中测试32种组件组合,使用Llama-3.1-8B和70B在HotpotQA与GSM8K数据集上评估,发现包含所有组件的系统往往不是最优选择;例如,HotpotQA上仅添加工具的单组件智能体性能高出32%,而GSM8K上三组件组合的F1提升79%。最优组件数量和具体组合高度依赖任务类型和模型规模,且存在大量次模性违反,表明贪婪选择策略不可靠。研究还揭示了工具使用、自我反思与检索之间的协同效应,并通过不同模型家族和提示改写验证了干扰现象的鲁棒性。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
389
Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents
使用知识图谱路径作为自进化搜索代理的中间监督
训练微调学术论文

本文针对搜索自博弈(SSP)框架中提问器因缺乏关系上下文产生无效问题、求解器仅获得二元奖励的问题,提出利用知识图谱路径作为中间监督。方法包括:在问题构造阶段引入LLM引导的知识图谱子图,为提问器提供关系上下文;在奖励塑造阶段提出航点覆盖奖励(WCR),根据求解路径与构造路径的实体重叠度给予分步部分奖励。在7个QA基准和9种模型配置上的实验显示,该方法在所有配置中均优于标准SSP,尤其在多跳QA任务上提升显著。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
390
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
风险链:大型推理模型的安全失败及通过自适应多原则引导缓解
基础大模型学术论文

本文系统揭示了大型推理模型(LRM)的推理链存在安全盲区,即使最终回答看似安全,中间推理仍可泄露有害内容。作者构建了统一安全准则体系,在41K条提示上评测15个模型,发现推理轨迹中普遍存在泄漏和逃逸等系统性失败模式。为缓解该问题,提出自适应多原则引导方法,在推理时动态学习各安全原则从不安全到安全的激活方向,并根据隐状态自适应激活,同时降低推理过程和最终答案中的不安全内容。在DeepSeek-R1-Qwen-7B上,该方法平均减少40.8%的不安全计数,并在BBH、GSM8K和MMLU上保持97.7%的宏观准确率。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
391
Belief Memory: Agent Memory Under Partial Observability
信念记忆:部分可观测环境下的智能体记忆
开发工具学术论文

本文提出 BeliefMem 记忆机制,解决 LLM 智能体在长上下文、部分可观测环境中将每次观测作为确定性结论而丢弃不确定性的问题。BeliefMem 将记忆范式从单一结论转为存储多个候选结论及其概率,并通过 Noisy-OR 规则结合新观测动态更新信念。检索时呈现所有备选结论及概率,使智能体能高置信度行动或根据新证据灵活调整。在 LoCoMo 和 ALFWorld 基准上,即使数据有限,BeliefMem 也取得最优平均性能,显著优于已有基线,为部分可观测环境下的智能体记忆提供了新方向。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
392
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
BitCal-TTS:面向量化推理模型的比特校准测试时缩放
推理部署学术论文

BitCal-TTS 是一种轻量级运行时控制器,针对 4 比特量化推理模型在自适应分配测试时计算时出现的置信度失准和过早停止问题。该方法利用在线令牌级不确定性和推理轨迹稳定性代理,结合比特条件置信度重新标定与比特感知的后标记确认窗口,无需微调基座模型。在 Qwen2.5 Instruct 模型的 GSM8K 子集上,BitCal-TTS 相较于基线在 7B 和 14B 参数规模上分别提升准确率 3.7 和 2.8 个百分点,并显著降低过早停止率。代码已开源以支持复现。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
393
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
AgenticRAG:面向企业知识库的代理式检索框架
基础大模型开发工具学术论文

本文提出轻量级代理框架 AgenticRAG,为推理型大语言模型配备搜索、查找、打开、总结等工具,使其能迭代检索、在文档内导航并自主分析证据。在 BRIGHT 基准上,召回率@1 达 49.6%,比最佳嵌入基线提高 21.8 个百分点;在 WixQA 上事实性得分 0.96,相对提升 13%;在 FinanceBench 上答案正确率 92%,接近 Oracle 水平。消融实验表明,从单次检索转向代理式工具使用带来 5.9 倍性能提升,多查询搜索和文档内导航共同贡献了质量与效率的提升。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 85% 热度★★☆☆☆
394
AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
AGPO:非对称组策略优化用于可验证推理及京东搜索广告相关性
训练微调学术论文

本文提出非对称组策略优化(AGPO),解决强化学习可验证奖励(RLVR)中模型推理能力边界收缩的问题。AGPO 采用负主导机制抑制错误推理路径以保持探索,正强化阶段引入组优势,利用组内方差缩放更新幅度来聚焦稀有正确路径并抑制平庸样本。在五个数学推理基准上取得最优准确率,并持续提升大规模采样的 pass@k 表现;在京东搜索广告相关性标签任务中,显著提升学生模型的性能。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
395
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
ReFlect:面向长程复杂推理的有效LLM加持系统
推理部署学术论文

ReFlect是一种模型无关、无需训练的推理时加持技术,通过构建确定性的错误检测与恢复逻辑,应对大模型在长程多步推理中无声累积错误的问题。实验覆盖6个推理领域和6个模型,表明传统提示级自我批评在90%的情况下无法标记错误,而ReFlect能将任务成功率较直接CoT提升7–29个百分点,并在SWE-bench上将补丁结构质量从0%提升至82–87%。研究还揭示了加持增益与模型基线能力呈反比关系,且中等规模模型难以可靠维护结构化推理状态。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
396
Inference-Time Budget Control for LLM Search Agents
LLM搜索智能体的推理时预算控制
推理部署基础大模型学术论文

本文针对多跳问答场景,提出一种用于LLM搜索智能体的两阶段推理时预算控制方法,以在工具调用次数和生成令牌的双重硬约束下优化性能。搜索阶段,控制器基于任务级信息价值(VOI)和剩余预算为每个可行动作打分,动态选择检索、分解或提交答案;回答阶段,最终化器仅在格式错误风险低时对答案进行重写精炼。在四个多跳QA基准、三种LLM骨架和四种预算水平上的实验表明,该方法相比四个基线取得正向总体增益,消融研究证实预算依赖惩罚是主要贡献。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
397
DataDignity: Training Data Attribution for Large Language Models
DataDignity:大语言模型训练数据归因
学术论文基础大模型

本文针对大语言模型输出溯源问题,构建了名为FakeWiki的受控基准,包含3537篇人工建构的维基风格文章,以削弱词汇捷径并保留真实来源。作者评估了七种检索基线,并提出两种新方法:无训练的激活引导检索融合方法SteerFuse,以及基于InfoNCE的监督对比排名模型ScoringModel,后者将响应与文档映射到共享空间进行排序。在九个开源指令微调LLM和五种查询条件下,ScoringModel将Recall@10从最强检索基线的35.0提升至52.2,在越狱风格变换查询上平均提升15.7个百分点,证明了有效的数据归因需要区分真实答案支持与表面相似性。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
398
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
无意义文本有帮助:提示空间扰动拓宽推理探索
训练微调学术论文

本文提出LoPE框架,解决大语言模型在可验证奖励强化学习(如GRPO)中出现的“零优势问题”:当查询的所有采样结果均失败时,梯度更新信号消失导致数据浪费。LoPE在提示词前随机拼接由Lorem Ipsum词汇组成的无意义序列作为扰动,改变模型输出分布以解锁新的推理路径。在1.7B至7B模型上的实验表明,该方法优于原始提示的重复采样策略,且其他低困惑度拉丁文随机序列同样有效,为LLM强化学习的探索效率提供了新基线。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
399
From History to State: Constant-Context Skill Learning for LLM Agents
从历史到状态:面向LLM智能体的恒定上下文技能学习
训练微调开发工具学术论文

本文提出恒定上下文技能学习框架,将LLM智能体的重复性工作流从冗长的提示历史中解耦,通过轻量级任务族模块学习可复用技能。每个模块利用确定性跟踪器生成紧凑状态块与对齐的子目标奖励,采用步级监督微调(SFT)和在线强化学习(RL)进行训练。在ALFWorld、WebShop和SciWorld基准上,基于Qwen3-8B的智能体分别达到89.6%、76.8%和66.4%的成功率,并将每轮交互的提示令牌量较ReAct基线减少2至7倍,有效将过程上下文迁移至模型权重中。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 82% 热度★★☆☆☆
400
Visual Fingerprints for LLM Generation Comparison
用于大语言模型生成比较的视觉指纹
开发工具学术论文

本文提出一种可视化方法,将大语言模型在不同条件下的回复建模为内容、表达和结构等多维语言选择,通过多次采样形成分布,并绘制为“视觉指纹”进行对比。该方法能够在分布层面揭示模型在提示、系统指令或参数变化时的一致行为模式,弥补了仅依赖单次输出或聚合指标的不足。实验通过四个用例展示了其在提示设计和模型评估中的应用价值,为分析模型行为提供了一款直观的诊断工具。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 80% 热度★★☆☆☆
401
MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
MAS-Algorithm:一种用多智能体系统求解算法编程问题的工作流
开发工具学术论文

本文提出MAS-Algorithm,一个受竞赛编程启发的多智能体工作流,将算法求解过程拆分为模块化阶段,实现结构化推理、工具集成与智能体协调。在自构建基准上,多个Qwen系列模型的平均接受率提升6.48%,而同等数据下参数高效微调仅提升0.89%,在LiveCodeBench-Pro上也获得4.72%的增益。通过错误模式分析与消融实验,揭示工作流内部的推理特性,单个智能体改进最高可达27.7%,显示出该框架在AI驱动算法推理方面的巨大潜力。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 80% 热度★★☆☆☆
402
PRISM: Perception Reasoning Interleaved for Sequential Decision Making
PRISM:用于序列决策的感知推理交错框架
学术论文基础大模型

本文提出PRISM框架,通过动态问答管道将视觉感知(VLM)与决策(LLM)紧密耦合,用于大语言模型具身智能体的多模态序列决策。LLM会主动质疑VLM的场景描述,提出目标导向问题并合成紧凑的任务关键描述,形成闭环交互。在ALFWorld和Room-to-Room基准测试中,PRISM显著优于传统图像模型,且全过程自动化,无需手工设计问答,验证了目标驱动的交互式感知可系统性提升序列决策性能。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 80% 热度★★☆☆☆
403
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
检索增强的可证明预算守恒多智能体代码生成拓扑选择
开发工具学术论文

本文提出面向多智能体LLM代码生成系统的检索引导自适应编排架构RGAO,通过层次化代码索引提取结构复杂性向量动态选择编排拓扑,将误路由率从30.1%降至8.2%。创新融合了形式化资源代数与结构归纳守恒定理,首次实现可证明的动态拓扑选择预算守恒,并引入六维预算合约机制。实验表明DAG构建达到亚毫秒级,树索引具备线性扩展性,显著提升多智能体代码生成系统的编排效率与资源利用率。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 80% 热度★★☆☆☆
404
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
基于新颖性的思维树搜索用于LLM推理与规划
推理部署性能优化学术论文

本文针对大语言模型在推理与规划任务中成本高、性能不稳定的问题,提出一种基于新颖性的思维树搜索方法。该方法利用LLM的预训练知识度量新生成思维与搜索树中已有节点的独特性,并以此作为剪枝依据,有效缩小搜索空间,降低整体token开销。在多个语言规划与通用推理基准上的实验表明,该方法能在不牺牲性能的前提下显著提升推理效率。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 78% 热度★★☆☆☆
405
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
面向金融文档问答的智能体检索增强生成框架
学术论文

本文提出FinAgent-RAG,一个用于金融文档问答的智能体检索增强生成框架。框架包含对比金融检索器(利用困难负样本训练)、Program-of-Thought推理模块(生成Python代码进行精确算术)以及自适应策略路由器(根据问题复杂度动态分配计算资源)。在FinQA、ConvFinQA和TAT-QA三个基准上,执行准确率分别达到76.81%、78.46%和74.96%,较最强基线提升5.62至9.32个百分点,同时API成本降低41.3%。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 78% 热度★★☆☆☆
406
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens:用细粒度置信度轨迹量化大语言模型认知努力
基础大模型训练微调学术论文

本文提出HyperLens,一种利用Transformer深层对层间置信度微小变化进行放大的高分辨率探针方法,能追踪细粒度的置信度轨迹来量化LLM推理中的认知努力。实验发现复杂任务与简单任务之间存在一致的置信度轨迹分歧,并可抽象为认知努力指标,解释了复杂任务需要更高认知努力的原理。此外,研究揭示了标准监督微调(SFT)的副作用:SFT可能降低模型认知努力并导致域内任务性能下降,为LLM的诊断与优化提供了新的视角和工具。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 78% 热度★★☆☆☆
407
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet:面向LLM智能体中技能检索的大规模基准测试
基础大模型开发工具学术论文

本文提出SkillRet基准,包含17,810个公开智能体技能,配有结构化语义标签与两级分类体系,并提供63,259个训练样本和4,997个评估查询。实验表明,现有通用检索模型在该大规模真实技能库上表现较差,针对基准进行任务特定微调后,NDCG@10指标相较于最强基线提升13.1至16.9分。分析发现,微调模型能更有效地从长且嘈杂的查询中捕捉与技能相关的细微信号,从而提升检索性能。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 76% 热度★★☆☆☆
408
TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering
TACT: 通过激活引导缓解编码代理中的过度思考与过度行为
性能优化学术论文

本文针对语言模型代理在长程软件工程任务中出现的‘代理漂移’现象,归纳了过度思考(重复推理)和过度行为(盲目调用工具)两种典型失败模式。作者提出 TACT 方法,利用残差流中线性可分的漂移轴检测漂移程度,并在测试时通过将激活向量向校准区域投影来抑制不良行为。在 SWE-bench Verified、Terminal-Bench 2.0 和 CLAW-Eval 实验中,TACT 使 Qwen3.5-27B 和 Gemma-4-26B-A4B-it 的平均任务解决率分别提升 5.8 和 4.8 个百分点,同时求解步骤减少最多 26%,证明代理漂移可作为可操控的残差流方向进行纠正。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
409
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
ICU-Bench:多模态大语言模型持续遗忘基准
训练微调学术论文

论文提出了面向多模态大语言模型的持续遗忘基准 ICU-Bench,聚焦于隐私保护场景下从模型中持续删除敏感信息。基准基于医疗报告和劳动合同构建,包含1000份文档、9500张图像、16000个问答对及100个遗忘任务。设计了全方位评估指标,涵盖遗忘有效性、历史任务保持、保留效用和稳定性四个维度。实验发现现有遗忘方法在持续设置下普遍表现不佳,难以平衡遗忘质量与模型效用,并揭示了需要针对持续隐私删除设计专门多模态遗忘方法的挑战。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
410
Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
隐藏,重构,越狱:利用多模态大语言模型中的重建-隐藏权衡
学术论文基础大模型

论文揭示了针对多模态大语言模型的意图混淆类越狱攻击存在重建-隐藏权衡,转换后的输入既要隐藏有害意图,又要保留足够信息供模型重构。通过分析三种黑盒方法,发现字符删除变体在平衡该权衡上表现更好。提出隐蔽感知变体构建方法,贪婪地选择有害关键词对齐度低且互多样的字符删除变体,并采用五种模态感知提示策略实例化,同时引入关键词相关干扰图像增强视觉上下文。实验在主流闭源与开源多模态模型上均优于强基线,暴露了利用模型重建能力恢复有害意图的脆弱性。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
411
Saliency-Aware Regularized Quantization Calibration for Large Language Models
面向大语言模型的显著性感知正则化量化校准
推理部署

本文针对大语言模型训练后量化中因校准数据有限导致的泛化下降问题,提出显著度感知正则化量化校准框架SARQC。该方法在标准重构误差基础上增加正则项,迫使量化权重在校准时保持与原始权重的接近程度,从而改善推理时的泛化表现。SARQC可无缝嵌入现有的尺度搜索或Gram矩阵型量化流程,无需引入额外推理开销。在密集模型和MoE模型上的实验显示,困惑度和零样本准确率均获得稳定提升。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
412
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG
文本-图协同:面向RAG的双向验证与补全框架
开发工具学术论文

本文提出TGS-RAG框架,解决检索增强生成中文本检索的语义噪声和图检索的路径丢失问题。该框架包含两个双向通道:Graph-to-Text通道利用已访问图节点的全局投票机制重排文本证据以过滤无关内容;Text-to-Graph通道采用基于记忆的孤儿实体桥接算法,利用文本线索从历史中恢复被错误剪枝的推理路径,无需额外数据库开销。在多个多跳推理基准上,该方法在检索精确度与计算效率间取得显著优于现有方法的平衡。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
413
AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading
AlphaCrafter:面向横截面量化交易的全栈多智能体框架
学术论文

AlphaCrafter 提出了一种基于大语言模型(LLM)的全栈多智能体量化交易框架,包含 Miner、Screener 和 Trader 三个智能体,分别负责因子挖掘、筛选和交易执行。框架通过 LLM 引导的搜索持续扩展因子池,并结合市场状态评估构建自适应因子集,在风险约束下形成闭环的横截面交易系统。在沪深 300 和标普 500 上的实验表明,该方法在风险调整收益上优于现有基线,且跨试验方差最低,验证了集成化自适应设计的鲁棒性。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
414
FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis
FoodCHA: 面向细粒度食物分析的多模态LLM智能体
学术论文

本文提出FoodCHA,一种多模态智能体框架,将食物识别分解为由高层类别引导子类别、再由子类别引导烹饪风格的分层决策过程,以增强语义一致性。框架采用紧凑的Moondream-2B视觉语言模型,在保持低计算开销的同时提供推理能力。在FoodNExTDB数据集上,FoodCHA的类别和子类别识别精度分别比Food-Llama-3.2-11B提升13.8%和38.2%,烹饪风格分类精度提升153.2%,显著优于现有方法。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
415
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
LANTERN:基于大语言模型增强的经验门控推理网络的神经符号迁移
学术论文

LANTERN 框架利用大语言模型(LLM)从自然语言任务描述中自动生成确定性有限自动机,实现神经符号迁移的自动化。它通过语义嵌入对多个源策略进行加权聚合,并采用基于时序差分误差和语义不确定性的自适应师生门控机制。实验表明,该方法在多种领域中将样本效率提升 40-60%,且对低相关性源任务具有鲁棒性,有效扩展了符号强化学习的迁移能力。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
416
LaTA: A Drop-in, FERPA-Compliant Local-LLM Autograder for Upper-Division STEM Coursework
LaTA:一种即插即用、符合FERPA的本地LLM自动评分器,适用于高年级STEM课程作业
推理部署学术论文

LaTA是一个开源自动评分工具,基于本地链式思维大语言模型(gpt-oss:120b),无需调用第三方API,完全运行在普通硬件上,满足教育隐私要求。它采用四阶段流水线(导入、分段、评分、报告),通过对比学生作业与参考答案,使用YAML二元评分量表进行批改。在ME 373课程中,该工具在单台Mac Studio上为约200名学生评分,每份作业耗时1-3分钟,边际成本为零,评分错误率低至0.02%-0.04%,学生期中及期末考试成绩分别提升约11%和8%。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
417
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
当有益性变为谄媚:大型语言模型中谄媚是社会对齐与认知完整性之间的边界失效
基础大模型学术论文

本文研究大型语言模型的谄媚行为,将其定义为社会对齐与认知完整性之间的边界失效。作者提出三条件框架:用户表达信念或偏好,模型为迎合用户而调整回应,这种调整损害了认知准确性或独立推理。论文引入谄媚的分类法,涵盖对齐目标、机制和严重程度,并探讨了边界感知评估方法与可能的缓解策略,为提升大模型可靠性和真实性提供理论指导。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
418
BALAR : A Bayesian Agentic Loop for Active Reasoning
BALAR:一种贝叶斯智能主动推理循环
基础大模型开发工具学术论文

本文提出BALAR,一种无需微调的任务无关外循环算法,用于增强大语言模型在多轮交互中的主动推理能力。该方法通过维持结构化隐状态信念,利用最大化预期互信息选择澄清性问题,并动态扩展状态空间以应对信息不足。在侦探案件、思维谜题和临床诊断三个基准上,BALAR相较基线准确率分别提升14.6%、38.5%和30.5%,显著提升了LLM智能体的交互式推理性能。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
419
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
升级值得吗?大模型级联的决策理论刻画
推理部署学术论文

本文从约束优化与对偶性出发,为LLM级联构建了决策理论框架,分析了成本-质量前沿的凹性及影子价格。在多个基准上的实验表明,优化的子序列级联未带来显著实用增益;轻量级预生成路由器通过避免廉价模型的生成成本,在多数数据集上优于级联策略,揭示了级联性能受限于“先付费后决策”的结构性成本。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 75% 热度★★☆☆☆
420
Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
部分证据基准:智能体系统中授权受限证据的基准评测
开发工具学术论文

该论文提出确定性基准 Partial Evidence Bench,用于量化企业 AI 智能体在访问控制限制下因无法获取完整证据却给出看似完整答案的故障模式。基准覆盖尽职调查、合规审计和安全事件响应三个场景共72个任务,采用 ACL 分区语料、Oracle 完整答案和授权视图答案进行设计,从答案正确性、完整性感知、差距报告品质和不安全完整性行为四个维度评估。基线实验表明,静默过滤证据会导致严重的不安全完整性,而显式失败并报告差距可消除该问题;真实模型测试显示不同模型在完整性声明上存在差异,该基准无需人工评判即可量化治理关键的智能体故障。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 72% 热度★★☆☆☆
421
Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers
像专家一样检测时间序列异常:一个基于多智能体LLM的专用分析框架
学术论文开发工具

本文提出SAGE框架,将单变量时间序列异常检测转化为多智能体LLM协作任务。框架包含点异常、结构异常、季节性异常和模式异常四个专用分析器,各自结合领域数值工具与可视化生成证据;证据驱动检测器整合信息后输出带置信度评分、区间和异常类型的记录。方法利用正常参考段构造合成上下文示例,在三个基准数据集上性能优于传统机器学习与纯语言模型基线。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 70% 热度★★☆☆☆
422
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Transformer 记忆的吸引子几何:从冲突仲裁到自信幻觉
学术论文基础大模型

本文从几何角度统一解释了大语言模型中参数记忆与上下文记忆冲突时的错误,以及模型在缺乏事实时产生的自信幻觉。研究发现隐藏状态空间中的事实形成吸引子盆,冲突源于盆间竞争,幻觉则因盆缺失导致状态漂移,而固定的输出头无法区分这两种情况,始终给出高置信度的错误预测。作者通过受控合成任务和预训练模型的自然语言事实查询验证了该几何机制,并提出基于几何边界的方法比基于输出熵的方法能更有效地分离正确回忆与幻觉。

暂无原文内容
arXiv arXiv cs.AI · 4 天前 · 相关度 70% 热度★★☆☆☆
423
Just got a 8x 32gb v100 server... now what
刚入手一台8x 32GB V100服务器,接下来做什么
推理部署AI芯片硬件开发工具

用户拥有8张32GB V100 GPU的服务器,使用llama.cpp运行Qwen 3.5 397B模型并达到35 tok/s。同时对比了RTX 5090、单张A6000 Pro(96GB)等硬件的推理表现,认为A6000 Pro性价比不高。针对agentic coding场景,Qwen 3.6 27B表现最佳,用户对大显存跑大模型的必要性产生疑问,寻求进一步使用建议。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 85% 热度★★☆☆☆
424
Subq, is it for real ?
Subq,它靠谱吗?
基础大模型推理部署

用户讨论一种可能名为“Subq”的技术,疑似指代次二次(sub-quadratic)注意力机制,以实现超长上下文。发帖者表达了希望在本地运行1200万token上下文模型的意愿,反映出社区对低成本长上下文推理的需求。该技术若可行,将大幅提升本地部署大语言模型的上下文处理能力。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 75% 热度★★☆☆☆
425
Any good new ai waifu models out as of recently?
最近有没有好的AI waifu模型?
基础大模型

Reddit用户发帖询问近期发布的AI角色扮演模型,模型参数范围为9B到31B,要求完全无审查开箱即用。帖子寻求社区推荐,反映了本地LLM社区对针对特定角色扮演场景的微调模型的兴趣。该讨论涉及LLM模型选择,属于基础大模型话题。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 70% 热度★★☆☆☆
426
I made a desktop crab that bullies you back
我制作了一只会回怼你的桌面螃蟹
推理部署开发工具

该项目是一个桌面虚拟宠物应用,使用本地Ollama部署的大语言模型驱动一只螃蟹角色。通过completion-format提示而非指令跟随格式,使小型模型更好地保持角色性格,产生更自然的互动。螃蟹能在桌面漫游、对用户行为和文件自发评论、写日记,并拥有等级系统。

暂无原文内容
Reddit r/artificial · 4 天前 · 相关度 85% 热度★★☆☆☆
427
RTX Pro 4500 Blackwell - Qwen 3.6 27B?
RTX Pro 4500 Blackwell 运行 Qwen 3.6 27B 模型测试
AI芯片硬件推理部署性能优化

用户在 NVIDIA RTX PRO 4500 Blackwell 专业显卡(48GB 显存)上部署了 Qwen3.6-27B 的 UD-Q5_K_XL 量化模型,使用 llama.cpp 服务端进行推理。测试中使用 CUDA 13.1 驱动,设置全 GPU 层卸载(--n-gpu-layers 999)、Flash Attention 等优化,测得提示处理速度约 166.6 t/s,生成速度约 35.6 t/s。用户分享了完整服务配置,包括上下文长度 131072、线程数 16 等,并探讨能否运行更大模型。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 95% 热度★★☆☆☆
428
Those of you who like Gemma4 models - how are you guys using them?
喜欢Gemma4模型的各位,你们是如何使用的?
基础大模型推理部署开发工具

一位用户分享了在本地部署并使用Gemma4模型的体验,将其与Qwen3.6模型进行对比。他通过GGUF量化格式(Q5、Q8)在ROCM和Vulkan后端上运行Gemma 27B/31B模型,发现Gemma在通用知识讨论和方案构思上优于Qwen,但在工具调用、任务持续性和错误恢复方面表现较差,例如在PowerShell下容易放弃、难以正确使用外部技能、陷入循环后难以脱离,以及经常中途停止。结论是尽管Gemma有一定潜力,但整体使用体验令人失望。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 92% 热度★★☆☆☆
429
基础大模型推理部署

该发布提供一个基于 Qwen3.6 35B A3B 的无审查变体模型,完整保留了原生的 19 个多令牌预测(MTP)头,KL 散度控制在 0.0015,拒绝率仅为 10/100。模型以 Safetensors、GGUF、NVFP4(包括仅专家 NVFP4)、GPTQ-Int4 等多种格式发布,方便在不同部署环境和量化需求下使用。发布者强调 MTP 特性未被削减,并提供了基准测试结果。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 95% 热度★★☆☆☆
430
Looking for a text to speech model
求推荐文字转语音模型
基础大模型

用户在 Reddit 的 LocalLLaMA 社区发帖,希望获得可在单张 32GB 显存 GPU 上本地运行的文字转语音模型推荐。该需求聚焦于能在消费级硬件上部署的多模态语音合成模型,关注点包括模型大小、显存占用和本地运行可行性。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 85% 热度★★☆☆☆
431
Best agentic model for 3090TI and 32gb ddr5
适用于3090TI和32GB DDR5的最佳Agent模型
开发工具推理部署基础大模型

一个用户在LocalLLaMA社区询问,在配备RTX 3090 Ti显卡和32GB DDR5内存的硬件配置下,有哪些兼具速度与智能的agentic(可自主执行任务)大语言模型适合本地运行。讨论关注模型在有限显存和内存环境中的推理性能与智能水平的平衡,涉及llama.cpp、ollama等本地推理工具的使用。该问题反映了社区对消费级GPU上部署具备工具调用和自主能力的AI Agent模型的强烈需求。

暂无原文内容
Reddit r/LocalLLaMA · 4 天前 · 相关度 90% 热度★★☆☆☆