Tommy

内语引擎(ISE)

Inner Speech Engine (ISE)

···阅读reads

Inner Speech Engine: Vygotskian Self-Dialogue for AI Intent Deliberation 论文(Markdown):ISE_Paper.md · 仓库:github.com/Tommickey2020gmail/inner-speech-engine

论文要解决的问题

LLM 智能体越来越自主,但缺一个原则性的机制来决定是否该行动。一个个人助理察觉到明天有重要面试——它该不该提醒?答案取决于一连串细微判断:用户是否已经知道?现在打扰是否合时?提醒是被欢迎还是冒犯?

现有架构对”该不该做”无能为力。规则系统刻板触发;ReAct 决定怎么做、不决定要不要做;Reflexion 是事后反思——动作已经发生。该不该这一层结构性审议,至今没有专门的脚手架。

框架:四阶段自我对话

ISE 把维果茨基的内语理论搬到 LLM 上,分四个阶段:

α: Self-Questioning   —— 生成 5–7 个评估问题
β: Self-Argumentation —— 对每个问题做正反论证 + 0–1 评分
γ: Self-Challenging   —— 对自己的论证做对抗性盘问
δ: Self-Deciding      —— 加权聚合 + 阈值 → ACT / ABSTAIN
认知功能内语表现ISE 阶段
自我调控”我该考虑什么?“Self-Questioning
评估”正反两面都是什么?“Self-Argumentation
对话式换位”如果我错了呢?“Self-Challenging
执行决断”权衡之下我该不该……”Self-Deciding

研究了五种实现变体:ISE-Full(4 次 LLM 调用,把四阶段拆开)、ISE-Internalized(1 次调用,把四阶段封装在同一个 prompt 里)、ISE-Gated(自适应 1–5 次调用)、ISE-NoChallenge、ISE-Compressed。

基准:PIDB

Proactive Intent Deliberation Benchmark:219 个人工标注的场景,三类——

  • Clear-Act(94):明显该介入
  • Clear-Abstain(67):明显不该介入
  • Ambiguous(58):边界情况,需要细腻判断

标注者一致性:κ = 0.844(主集合),κ = 0.600(模糊扩展)。

核心发现

方法F1FPRRecall
ISE-Internalized0.9070.2740.970
Chain-of-Thought0.8990.143
Direct Prompting0.8970.179
ISE-Gated0.8720.179
ISE-Full0.8550.238

认知脚手架是有价值的,但多阶段拆分不是。 把四阶段封装在单次调用内(ISE-Internalized)显著优于把它们拆成四次调用(ISE-Full)。Bootstrap 95% CI for ΔF1 在两个测试模型上都不跨零([+0.007, +0.081] / [+0.010, +0.111])。在模糊场景上,对 Direct Prompting 的优势在 Qwen 上达到统计显著(McNemar’s p = 0.031)。

这恰好镜像了维果茨基的发展轨迹:内化的内语优于外化的自我中心言语

两个副产品级的发现

1. 严重性锚定效应——把一个标量”挑战严重度”(0–1)从 γ 阶段传给 δ 阶段时,下游 LLM 会过度依赖这一个数字,忽略丰富的定性证据。把它换成结构化定性总结(“7 个维度里有 5 个支持行动”)后,F1 从 0.857 升到 0.901。这是 Tversky & Kahneman(1974)锚定偏差在多阶段 LLM pipeline 内的一次现身。

2. 自我反思悖论——Self-Refine、Reflexion 风格的迭代自我批评,在 Doubao / Qwen 上低于 Direct Prompting,在显然不该行动的场景上 FPR 高达 33–37%。机制:当初始判断已经正确(ABSTAIN),迭代批评会生造出”为什么或许该行动”的假设性理由,让模型把自己说服到错误答案。

为什么内化打败拆分

两个互补的机制:

  • 信息瓶颈:每一次阶段间的序列化都是有损压缩。δ 阶段拿到的是约 500 token 的结构化摘要,原始 ~1,000 token 的用户上下文已经被丢掉了一半。
  • 注意力窗口:在 ISE-Internalized 里,LLM 在生成 δ 段落时仍在原始情境上注意;ISE-Full 的 δ 阶段只能看见前置阶段的产物,无法回溯校验”用户真的会欢迎吗?“这类问题。

设计建议:当全部审议上下文能塞进单次调用窗口时,默认走内化,只在上下文超窗或必须中间可观测时才拆分。

与维果茨基的呼应

成熟的内语是缩略的——成年人会丢掉儿童才需要明说的冗余阐述。我们的消融实验也呼应这一点:在内化形式里,Self-Argumentation 这一步并不带来 F1 增益。当 Questioning 与 Challenging 已经存在时,显式的正反论证就像儿童把心里的话说出口——多余的外部化。最佳脚手架不是最详尽的,而是激活相关认知功能所需的最小结构。

技术栈:Python 3.11+ · litellm · Pydantic v2 · Jinja2 · pytest。

写作过程本身的几次反转和哲学反思,搬到了一篇随笔里:把脚手架拆下来

交叉链接:循环与自我Predictive processing 101Attention as relation, not state

Inner Speech Engine: Vygotskian Self-Dialogue for AI Intent Deliberation Paper (Markdown): ISE_Paper.md · Repo: github.com/Tommickey2020gmail/inner-speech-engine

The problem

LLM agents are increasingly autonomous, yet lack a principled mechanism for deciding whether to act. A personal assistant detects an important interview tomorrow — should it proactively remind the user? The answer depends on subtle factors: does the user already know, is now a good time, would the reminder be welcome or intrusive?

Existing architectures handle this poorly. Rule-based systems trigger rigidly. ReAct decides what to do, not whether. Reflexion does post-hoc reflection — the act has already happened. There is no dedicated scaffolding for the should-I layer.

Framework: four-stage self-dialogue

ISE operationalises Vygotsky’s inner speech theory into four cognitive stages:

α: Self-Questioning   → generates 5–7 evaluation questions
β: Self-Argumentation → pro/con reasoning per question, scores 0.0–1.0
γ: Self-Challenging   → adversarial self-examination
δ: Self-Deciding      → weighted aggregation + threshold → ACT / ABSTAIN
Cognitive functionInner-speech formISE stage
Self-regulation”What should I consider?”Self-Questioning
Evaluation”What are the reasons for/against?”Self-Argumentation
Dialogic perspective-taking”But what if I’m wrong?”Self-Challenging
Executive decision”On balance, I should/shouldn’t…”Self-Deciding

Five implementation variants studied: ISE-Full (4 LLM calls), ISE-Internalized (1 call), ISE-Gated (1–5 calls adaptive), ISE-NoChallenge, ISE-Compressed.

Benchmark: PIDB

Proactive Intent Deliberation Benchmark — 219 human-annotated scenarios in three categories:

  • Clear-Act (94): proactive intervention is appropriate
  • Clear-Abstain (67): intervention should be withheld
  • Ambiguous (58): borderline, needs nuanced judgement

Inter-annotator agreement: κ = 0.844 (main), κ = 0.600 (ambiguous expansion).

Headline finding

MethodF1FPRRecall
ISE-Internalized0.9070.2740.970
Chain-of-Thought0.8990.143
Direct Prompting0.8970.179
ISE-Gated0.8720.179
ISE-Full0.8550.238

Cognitive scaffolding is valuable, but multi-stage decomposition is not. Wrapping the four stages inside a single call (ISE-Internalized) significantly outperforms decomposing them into four calls (ISE-Full). Bootstrap 95% CIs for ΔF1 exclude zero on both tested models ([+0.007, +0.081] / [+0.010, +0.111]). On ambiguous scenarios, ISE-Internalized’s advantage over Direct Prompting reaches significance on Qwen (McNemar’s p = 0.031).

This mirrors Vygotsky’s developmental trajectory: internalised inner speech outperforms externalised egocentric speech.

Two by-product findings

1. The severity anchor effect. Passing a single scalar “challenge severity” (0–1) from γ to δ caused the downstream LLM to over-weight that one number, ignoring the rich qualitative evidence alongside it. Replacing it with a structured qualitative summary (“5 of 7 dimensions favour acting”) improved F1 from 0.857 to 0.901. This is Tversky & Kahneman’s (1974) anchoring bias surfacing inside a multi-stage LLM pipeline.

2. The self-reflection paradox. Self-Refine and Reflexion-style iterative self-critique underperform Direct Prompting on Doubao / Qwen, with FPR up to 33–37% on obviously-abstain scenarios. The mechanism: when the initial answer is correct (ABSTAIN), iterative critique generates hypothetical reasons why action might be warranted — and the model talks itself into the wrong answer.

Why internalisation beats decomposition

Two complementary mechanisms:

  • Information bottleneck. Each inter-stage serialisation is a lossy compression. The δ stage receives ~500 tokens of structured summary; ~1,000 tokens of original user context have been thrown away.
  • Attention window. In ISE-Internalized the LLM still attends to the original scenario while generating the deciding paragraph. In ISE-Full, δ sees only the artefacts of prior stages and cannot verify questions like “would the user actually welcome this?”

Design recommendation: when full deliberation context fits in one call, default to internalisation. Decompose only when context exceeds the window or intermediate inspectability is a hard requirement.

Echoing Vygotsky

Mature inner speech is abbreviated — adults drop redundant elaboration that children have to say aloud. Our ablations echo this: in the internalised form, Self-Argumentation does not improve F1. When Questioning and Challenging are already present, explicit pro/con generation is like a child speaking the inner monologue out loud — redundant externalisation. The best scaffolding is not the most elaborate one but the minimum structure needed to activate the relevant cognitive functions.

Stack: Python 3.11+ · litellm · Pydantic v2 · Jinja2 · pytest.

Companion essay on the reversals encountered while writing the paper: Taking down the scaffolding.

Cross-links: The loop and the self, Predictive processing 101, Attention as relation, not state.

展开评论Show comments