Overview of Reinforcement Learning from Text Feedback (RLTF). The framework uses a feedback provider (judge) to generate critiques \(c_0\) on the policy's first-turn output \(y_0\). RLTF-SD (Self Distillation) trains the policy to match the feedback-conditioned second-turn generations \(y_1\), and RLTF-FM (Feedback Modeling) predicts the critiques \(c_0\) as an auxiliary objective. Both methods improve single-turn test-time performance by internalizing feedback during training.
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
This problem is naturally formulated as a multi-turn setting: the model generates an attempt, feedback is appended to form an extended prompt, and the model revises. The natural objective to optimize is the multi-turn objective, which maximizes expected sum of rewards over an \(H\)-turn interaction:
\[ J_{\text{MultiTurn}}(\pi) = \mathbb{E}^{\pi}\left[\sum_{h=0}^{H-1} r_h\right] \]
However, at test time, feedback is often unavailable, in the sense that users want good outputs on the first try. Therefore, our goal is to improve single-turn competence, so we define the single-turn objective:
\[ J_{\text{SingleTurn}}(\pi) = \mathbb{E}_{x_0 \sim \mu}\left[\mathbb{E}_{y \sim \pi(\cdot|x_0)}[R(x_0, y)]\right] \]
The central research question is: Given access to feedback-augmented trajectories during training, how can we design learning objectives and algorithms that improve \(J_{\text{SingleTurn}}(\pi)\)?
Text feedback can turn an incorrect first attempt into a correct second attempt. We convert this feedback-conditioned generation into improvement on the single-turn metric via Self Distillation: we treat the policy acting under the second-turn prompt as a teacher, and distill it into the original policy.
For each initial prompt \(x_0\), we sample a first-turn output \(y_0 \sim \pi(\cdot | x_0)\), obtain feedback \(c_0\), and form the feedback-augmented prompt \(x_1 = f(x_0, y_0, c_0)\). We then sample a revised output \(y_1 \sim \pi(\cdot | x_1)\) and use \(y_1\) to update \(\pi(\cdot | x_0)\) (not \(\pi(\cdot | x_1)\)).
We can treat the critique itself as a supervision signal and explicitly model the feedback provider. Feedback Modeling trains the policy to predict the feedback itself, providing dense token-level gradients on failure rollouts.
To do this, we define a feedback-prediction distribution \(p_\pi(c | x, y) := \pi(c | f_{\text{FeeMol}}(x, y))\). Because the feedback model uses the same LM, it enables test-time scaling via self-feedback: the model can generate its own critiques and perform iterative refinement at inference time.
We evaluate on three domains: reasoning puzzles (Knights and Knaves, Binary Matrix, Shortest Path), competition math (MATH500, AIME24), and creative writing (LitBench, WritingBench).
| Benchmark | Base Model | GRPO Single-turn | GRPO Multi-turn | Feedback Descent | RLTF-SD | RLTF-FM |
|---|---|---|---|---|---|---|
| Reasoning | ||||||
| Knights and Knaves | 0.058 | 0.373 | 0.352 | 0.055 | 0.802 | 0.880 |
| Binary Matrix | 0.001 | 0.125 | 0.950 | 0.005 | 0.976 | 0.978 |
| Shortest Path | 0.034 | 0.385 | 0.384 | 0.035 | 0.830 | 0.905 |
| Math | ||||||
| MATH500 (DAPO) | 0.376 | 0.526 | 0.523 | 0.415 | 0.548 | 0.567 |
| AIME24 (DAPO) | 0.025 | 0.058 | 0.025 | 0.045 | 0.088 | 0.083 |
| MATH500 (DeepMath) | 0.376 | 0.558 | 0.578 | 0.424 | 0.598 | 0.636 |
| AIME24 (DeepMath) | 0.025 | 0.042 | 0.050 | 0.054 | 0.058 | 0.058 |
| Creative Writing | ||||||
| LitBench | 4.20 | 6.83 | 6.41 | 8.25 | 8.80 | 8.40 |
| WritingBench | 5.71 | 5.92 | 6.29 | 5.30 | 6.71 | 6.39 |
Main findings:
A key question is whether the semantic richness of text feedback matters, or whether simply knowing correctness is sufficient. We compare RLTF-SD using full text critiques against a correctness-only baseline that replaces the judge's critique with a simple sentence: "Your previous answer was {correct/incorrect}".
Figure: Evaluation curves on Knights and Knaves and MATH500 (trained on DAPO) for text feedback vs. correctness-only feedback. We compare single- and multi-turn accuracy on two algorithms: multi-turn GRPO and RLTF-SD. Overall, using text feedback outperforms using correctness-only feedback for single-turn and multi-turn accuracy on both algorithms.
Key findings:
A unique advantage of RLTF-FM (Feedback Modeling) is that it enables test-time scaling via self-feedback. Because the model learns to predict critiques during training, it can generate its own feedback at inference time and perform iterative refinement—without requiring an external judge.
We evaluate the model trained with RLTF-FM on Knights and Knaves and MATH500 by allowing it to generate up to 5 rounds of self-feedback at inference time. We compare against a baseline that uses RL to improve the model's self-critique using second-turn reward, with early termination disabled during training.
Figure: Test-time scaling results on Knights and Knaves and MATH500 (trained on DAPO). The x-axis shows the number of self-feedback rounds at inference time. We compare RLTF-FM with multi-turn scalar-based RL, where the dashed line ("+ Self-Critique") denotes further using RL to improve the self-critique during training.
Key findings: