SPARKLE

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

¹University of Wisconsin-Madison, ²Salesforce AI Research

Introduction

Reinforcement learning (RL) has become the dominant paradigm for endowing language models with advanced reasoning capabilities. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of their advantages is still lacking. To address this gap, we introduce a fine-grained analytic framework to dissect the impact of RL on reasoning. Our framework specifically investigates key elements that have been hypothesized to benefit from RL training: (1) plan-following and execution, (2) problem decomposition, and (3) improved reasoning and knowledge utilization. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit step-by-step plans surprisingly degrades performance on the most challenging benchmarks, yet RL-tuned models exhibit greater robustness, experiencing markedly smaller performance drops than their base counterparts. This suggests that RL may not primarily enhance the execution of external plans but rather empower models to formulate and follow internal strategies better suited to their reasoning processes. Conversely, we observe that RL enhances the model's capacity to integrate provided knowledge into its reasoning process, leading to performance improvements across diverse tasks. We also study difficulty, showing improved training by developing new ways to exploit hard problems. Our findings lay a foundation for more principled training and evaluation of reasoning models.

Model	AIME	AMC	MATH500	GSM8K	Olympiad	Avg.
Qwen-2.5-Math-7B-Base	16.67	42.50	44.03	42.53	28.65	35.23
SparkleRL-Stage 1	46.67 (↑30.00)	67.50 (↑25.00)	80.00 (↑35.97)	91.77 (↑49.24)	39.11 (↑10.46)	65.01
SparkleRL-Stage 2 (Hard)	41.67 (↑25.00)	65.94 (↑23.44)	80.50 (↑36.47)	92.45 (↑49.92)	37.39 (↑8.74)	63.59
SparkleRL-Stage 2 (Mix)	40.00 (↑23.33)	63.44 (↑20.94)	80.78 (↑36.75)	92.52 (↑49.99)	38.85 (↑10.20)	63.12
SparkleRL-Stage 2 (Aug)	50.42 (↑33.75)	71.25 (↑28.75)	81.00 (↑36.97)	92.38 (↑49.85)	40.11 (↑11.46)	67.03

Model

AIME

AMC

MATH500

GSM8K

Olympiad

Avg.

Qwen-2.5-Math-7B-Base

16.67

42.50

44.03

42.53

28.65

35.23

SparkleRL-Stage 1

46.67 (↑30.00)

67.50 (↑25.00)

80.00 (↑35.97)

91.77 (↑49.24)

39.11 (↑10.46)

65.01

SparkleRL-Stage 2 (Hard)

41.67 (↑25.00)

65.94 (↑23.44)

80.50 (↑36.47)

92.45 (↑49.92)

37.39 (↑8.74)

63.59

SparkleRL-Stage 2 (Mix)

40.00 (↑23.33)

63.44 (↑20.94)

80.78 (↑36.75)

92.52 (↑49.99)

38.85 (↑10.20)

63.12

SparkleRL-Stage 2 (Aug)

50.42 (↑33.75)

71.25 (↑28.75)

81.00 (↑36.97)

92.38 (↑49.85)

40.11 (↑11.46)

67.03

BibTeX

@misc{wang2025accuracydissectingmathematicalreasoning, title={Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning}, author={Jiayu Wang and Yifei Ming and Zixuan Ke and Caiming Xiong and Shafiq Joty and Aws Albarghouthi and Frederic Sala}, year={2025}, eprint={2506.04723}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.04723}, }

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Introduction

Curriculum-style RL Training

🧠 Example: Original vs. Augmented Prompt

Key Takeaways

💡 Insight 1: Hard problems remain valuable when appropriately structured by augmenting problems with partial solutions during RL training.

💡 Insight 2: RL improves sample efficiency.

💡 Insight 3: RL-tuned models are more flexible in plan execution—they can generate better plans than those provided by humans.

💡 Insight 4: RL-tuned models can integrate external knowledge more effectively than base models.

💡 Insight 5: RL-tuned models can solve the original problems (e.g., AIME24, AMC23, MATH500), but fail to solve the decomposed subproblems.

💡 Insight 6: As problem difficulty increases, RL-tuned models benefit more from knowledge augmentation than high-level planning.

BibTeX