UCSB NLP × Cisco Research

Context Bootstrapped
Reinforcement Learning

Solving the exploration inefficiency problem in RLVR by teaching models with In-Context Learning.

Saaket Agashe¹ · Jayanth Srinivasa² · Gaowen Liu² · Ramana Kompella² · Xin Eric Wang¹

¹UC Santa Barbara ²Cisco Research

The Problem & Our Solution

TL;DR

RLVR fails when models can't produce correct rollouts early in training. No reward signal means no learning. CBRL fixes this exploration inefficiency by injecting few-shot examples into training prompts, more frequently at the start and over time annealing to zero. The model bootstraps new capabilities from the examples, then retains them after the examples are gone.

Injection probability annealing over training steps

(a) Annealing Schedule. The proportion of exemplar-augmented samples in each batch decreases as p_t decays from 0.5 to 0 over training.

Prompt construction with and without few-shot examples

(b) Prompt Construction. Samples without exemplars present only the query. Exemplar samples prepend solved few-shot demonstrations as prior conversation turns.

Inject Examples Early

Few-shot demonstrations are stochastically prepended to training prompts, guiding the model toward successful rollouts.

Anneal to Zero

Injection probability decreases linearly over training, forcing the model to solve problems independently.

Retain the Gains

The model internalizes the reasoning patterns. Performance persists long after demonstrations are removed.

Experiments

Key Results

CBRL improves every model-environment pair we tested across Reasoning Gym and Q Programming.

Reasoning Gym

5 tasks · Qwen2.5-3B-Instruct & Llama-3.2-3B-Instruct · gains from +1.3% to +22.3%

Environment	Qwen 2.5-3B		Llama 3.2-3B
Environment	GRPO	CBRL	GRPO	CBRL
ARC-1D	26.00	30.67 +4.67	17.00	25.00 +8.00
Manipulate Matrix	6.00	8.00 +2.00	3.33	8.33 +5.00
Spell Backward	51.00	52.67 +1.67	95.67	97.00 +1.33
Word Sorting	53.33	75.67 +22.34	80.00	84.33 +4.33
Puzzle-24	48.00	60.67 +12.67	20.67	22.00 +1.33

Q Programming

Domain-specific language for kdb+ · qqWen-7B-Pretrain (Morgan Stanley)

Method	Valid Q (%)	Avg. Pass (%)	Pass@1 (%)
Baseline	3.1	0.3	0.0
GRPO	89.1	27.3	5.0
CBRL GRPO	80.9	43.0 +15.7	26.3 +21.3

Training Dynamics

CBRL Learns Faster & Retains the Advantage

Shaded regions mark high injection (p_t > 0.25). The advantage persists after injection stops—CBRL bootstraps durable behaviors.

Q Programming (GRPO)

qqWen-7B-Pretrain · 512 steps

Word Sorting (GRPO)

Qwen2.5-3B-Instruct · 500 steps

Word Sorting (RLOO)

Qwen2.5-3B-Instruct · 500 steps

Baseline CBRL p_t > 0.25

Analysis

Ablations & Insights

Algorithm Agnostic

CBRL is not tied to GRPO. Under RLOO, Word Sorting jumps from 20% to 67%, Puzzle-24 from 23% to 66%.

Environment	RLOO	CBRL RLOO
ARC-1D	10.33	8.00
Word Sorting	20.33	67.33 +47.0
Puzzle-24	23.00	66.00 +43.0
Spell Backward	63.67	89.67 +26.0
Manipulate Matrix	8.67	1.67

How Much Injection?

Too little and you don't get the bootstrap. Too much and the model never learns independence. p_start = 0.5 hits the sweet spot.

Qualitative

What the Model Actually Learns

Task: Sort violates, yes, already, completing, pages, duty, his, EXPRESS, duly in ascending ASCII order.

Baseline (GRPO)

Think: The words provided are: violates, yes, already, completing, pages, duty, his, EXPRESS, duly. We need to sort these words in ascending order using ASCII/Unicode ordering. The sorting will consider letter cases and special characters. Let's sort the words.

EXPRESS, already, completing, duty, his, violates, pages, duly, yes

CBRL (Ours)

Think: The words need to be sorted in ascending ASCII order. We start by comparing the ASCII values of the first characters.
'v' in "violates" = 118
'y' in "yes" = 121
'a' in "already" = 97 …
Sorting: 69, 97, 99, 100, 100, 104, 112, 118, 121
Mapping: 'E', 'a', 'c', 'd', 'd', 'h', 'p', 'v', 'y' …

EXPRESS, already, completing, duly, duty, his, pages, violates, yes

Few-Shot Example

Think: Use ASCII/Unicode lexicographic ordering: compare strings by code points left to right …
"Transform" starts with 'T' (84), while remaining words start with lowercase (≥97) …
Between "following" and "forgetful": 'l' (108) < 'r' (114), so "following" precedes "forgetful".

"799", "Transform", "business", "days", "dead", "following", "forgetful", "turning"

Paper

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum: starting high to bootstrap early exploration, then annealing to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability by training on Q, a domain-specific programming language absent from typical pretraining corpora.

Context BootstrappedReinforcement Learning