UCSB NLP × Cisco Research

Context Bootstrapped
Reinforcement Learning

Solving the exploration inefficiency problem in RLVR by teaching models with In-Context Learning.

1UC Santa Barbara    2Cisco Research

TL;DR

RLVR fails when models can't produce correct rollouts early in training. No reward signal means no learning. CBRL fixes this exploration inefficiency by injecting few-shot examples into training prompts, more frequently at the start and over time annealing to zero. The model bootstraps new capabilities from the examples, then retains them after the examples are gone.

Injection probability annealing over training steps
(a) Annealing Schedule. The proportion of exemplar-augmented samples in each batch decreases as pt decays from 0.5 to 0 over training.
Prompt construction with and without few-shot examples
(b) Prompt Construction. Samples without exemplars present only the query. Exemplar samples prepend solved few-shot demonstrations as prior conversation turns.
1

Inject Examples Early

Few-shot demonstrations are stochastically prepended to training prompts, guiding the model toward successful rollouts.

2

Anneal to Zero

Injection probability decreases linearly over training, forcing the model to solve problems independently.

3

Retain the Gains

The model internalizes the reasoning patterns. Performance persists long after demonstrations are removed.

Key Results

CBRL improves every model-environment pair we tested across Reasoning Gym and Q Programming.

Reasoning Gym

5 tasks · Qwen2.5-3B-Instruct & Llama-3.2-3B-Instruct · gains from +1.3% to +22.3%

Environment Qwen 2.5-3B Llama 3.2-3B
GRPO CBRL GRPO CBRL
ARC-1D 26.00 30.67 +4.67 17.00 25.00 +8.00
Manipulate Matrix 6.00 8.00 +2.00 3.33 8.33 +5.00
Spell Backward 51.00 52.67 +1.67 95.67 97.00 +1.33
Word Sorting 53.33 75.67 +22.34 80.00 84.33 +4.33
Puzzle-24 48.00 60.67 +12.67 20.67 22.00 +1.33

Q Programming

Domain-specific language for kdb+ · qqWen-7B-Pretrain (Morgan Stanley)

Method Valid Q (%) Avg. Pass (%) Pass@1 (%)
Baseline 3.1 0.3 0.0
GRPO 89.1 27.3 5.0
CBRL GRPO 80.9 43.0 +15.7 26.3 +21.3

CBRL Learns Faster & Retains the Advantage

Shaded regions mark high injection (pt > 0.25). The advantage persists after injection stops—CBRL bootstraps durable behaviors.

Q Programming (GRPO)

qqWen-7B-Pretrain · 512 steps

Word Sorting (GRPO)

Qwen2.5-3B-Instruct · 500 steps

Word Sorting (RLOO)

Qwen2.5-3B-Instruct · 500 steps
Baseline CBRL pt > 0.25

Ablations & Insights

Algorithm Agnostic

CBRL is not tied to GRPO. Under RLOO, Word Sorting jumps from 20% to 67%, Puzzle-24 from 23% to 66%.

Environment RLOO CBRL RLOO
ARC-1D 10.33 8.00
Word Sorting 20.33 67.33 +47.0
Puzzle-24 23.00 66.00 +43.0
Spell Backward 63.67 89.67 +26.0
Manipulate Matrix 8.67 1.67

How Much Injection?

Too little and you don't get the bootstrap. Too much and the model never learns independence. pstart = 0.5 hits the sweet spot.

What the Model Actually Learns

Task: Sort violates, yes, already, completing, pages, duty, his, EXPRESS, duly in ascending ASCII order.

Baseline (GRPO)

Think: The words provided are: violates, yes, already, completing, pages, duty, his, EXPRESS, duly. We need to sort these words in ascending order using ASCII/Unicode ordering. The sorting will consider letter cases and special characters. Let's sort the words.

EXPRESS, already, completing, duty, his, violates, pages, duly, yes
CBRL (Ours)

Think: The words need to be sorted in ascending ASCII order. We start by comparing the ASCII values of the first characters.
'v' in "violates" = 118
'y' in "yes" = 121
'a' in "already" = 97 …
Sorting: 69, 97, 99, 100, 100, 104, 112, 118, 121
Mapping: 'E', 'a', 'c', 'd', 'd', 'h', 'p', 'v', 'y' …

EXPRESS, already, completing, duly, duty, his, pages, violates, yes
Few-Shot Example

Think: Use ASCII/Unicode lexicographic ordering: compare strings by code points left to right …
"Transform" starts with 'T' (84), while remaining words start with lowercase (≥97) …
Between "following" and "forgetful": 'l' (108) < 'r' (114), so "following" precedes "forgetful".

"799", "Transform", "business", "days", "dead", "following", "forgetful", "turning"

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum: starting high to bootstrap early exploration, then annealing to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability by training on Q, a domain-specific programming language absent from typical pretraining corpora.