Solving the exploration inefficiency problem in RLVR by teaching models with In-Context Learning.
RLVR fails when models can't produce correct rollouts early in training. No reward signal means no learning. CBRL fixes this exploration inefficiency by injecting few-shot examples into training prompts, more frequently at the start and over time annealing to zero. The model bootstraps new capabilities from the examples, then retains them after the examples are gone.
Few-shot demonstrations are stochastically prepended to training prompts, guiding the model toward successful rollouts.
Injection probability decreases linearly over training, forcing the model to solve problems independently.
The model internalizes the reasoning patterns. Performance persists long after demonstrations are removed.
CBRL improves every model-environment pair we tested across Reasoning Gym and Q Programming.
| Environment | Qwen 2.5-3B | Llama 3.2-3B | ||
|---|---|---|---|---|
| GRPO | CBRL | GRPO | CBRL | |
| ARC-1D | 26.00 | 30.67 +4.67 | 17.00 | 25.00 +8.00 |
| Manipulate Matrix | 6.00 | 8.00 +2.00 | 3.33 | 8.33 +5.00 |
| Spell Backward | 51.00 | 52.67 +1.67 | 95.67 | 97.00 +1.33 |
| Word Sorting | 53.33 | 75.67 +22.34 | 80.00 | 84.33 +4.33 |
| Puzzle-24 | 48.00 | 60.67 +12.67 | 20.67 | 22.00 +1.33 |
| Method | Valid Q (%) | Avg. Pass (%) | Pass@1 (%) |
|---|---|---|---|
| Baseline | 3.1 | 0.3 | 0.0 |
| GRPO | 89.1 | 27.3 | 5.0 |
| CBRL GRPO | 80.9 | 43.0 +15.7 | 26.3 +21.3 |
Shaded regions mark high injection (pt > 0.25). The advantage persists after injection stops—CBRL bootstraps durable behaviors.
CBRL is not tied to GRPO. Under RLOO, Word Sorting jumps from 20% to 67%, Puzzle-24 from 23% to 66%.
| Environment | RLOO | CBRL RLOO |
|---|---|---|
| ARC-1D | 10.33 | 8.00 |
| Word Sorting | 20.33 | 67.33 +47.0 |
| Puzzle-24 | 23.00 | 66.00 +43.0 |
| Spell Backward | 63.67 | 89.67 +26.0 |
| Manipulate Matrix | 8.67 | 1.67 |
Too little and you don't get the bootstrap. Too much and the model never learns independence. pstart = 0.5 hits the sweet spot.
Task: Sort
violates, yes, already, completing, pages, duty, his, EXPRESS,
duly
in ascending ASCII order.
Think: The words provided are: violates, yes, already, completing, pages, duty, his, EXPRESS, duly. We need to sort these words in ascending order using ASCII/Unicode ordering. The sorting will consider letter cases and special characters. Let's sort the words.
Think: The words need to be sorted in
ascending ASCII order. We start by comparing the ASCII values of
the first characters.
'v' in "violates" = 118
'y' in "yes" = 121
'a' in "already" = 97 …
Sorting: 69, 97, 99, 100, 100, 104, 112, 118, 121
Mapping: 'E', 'a', 'c', 'd', 'd', 'h', 'p', 'v', 'y' …
Think: Use ASCII/Unicode
lexicographic ordering: compare strings by code points left to
right …
"Transform" starts with 'T' (84), while remaining words
start with lowercase (≥97) …
Between "following" and "forgetful":
'l' (108) < 'r' (114), so "following"
precedes "forgetful".
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum: starting high to bootstrap early exploration, then annealing to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability by training on Q, a domain-specific programming language absent from typical pretraining corpora.