Staff Research Scientist · RL & Systems for LLM Agents
Rafael Pardinas
My research focuses on scalable reinforcement learning systems for large language models, across algorithms and training systems. I currently work on reasoning, self-improvement, efficient on-policy training at scale, long-horizon credit assignment in multi-turn RLVR, memory-augmented agents that improve through open-ended interaction, and cross-domain generalisation.
Good research and good engineering belong together: reproducible training recipes, reliable evaluations, and open-source systems that make ideas inspectable are all part of the same ecosystem.
Post-training
RLVR, domain mixtures, efficient reasoning traces
Systems
Asynchronous rollouts, on-policy freshness, distributed training
Agents
Long-horizon interaction, memory, privacy-aware deep research
Selected publications
Papers
2026 · RL post-training · First author
Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Reproducible multi-domain RLVR recipe for a 15B open-weight model, with adaptive domain sampling and a difficulty-aware length penalty for stronger, shorter reasoning.
Read paper
2025 · RL systems · Open source
PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
Asynchronous RL infrastructure with in-flight weight updates for fast long-sequence generation while keeping training data near on-policy.
Read paper
2026 · Deep research agents · Privacy
MosaicLeaks: Privacy Risks in Querying-in-the-Open for Deep Research Agents
Benchmark and RL framework for agents that must balance task success with privacy leakage from external research queries over multi-hop local and web evidence.
Read paper
2026 · Efficient LLM serving · Apriel
Super Apriel: One Checkpoint, Many Speeds
A 15B supernet that supports multiple decoding speed-quality presets from one checkpoint, with released models, serving code, and placement tooling.
Read paper
2024 · Agent framework · Open source
TapeAgents: a Holistic Framework for Agent Development and Optimization
Tape-centered agent design for resumable state, debugging, evaluation, fine-tuning, prompt tuning, and reusable agent traces.
Read paper
Earlier RL and ML
Offline RL, functional regularization, and applied ML systems
Earlier work spans implicit offline RL, target-network regularization, active learning, and practical ML workflows for high-stakes investigation settings.
Research systems
Code and infrastructure
RLVR · Systems · Open source
PipelineRL
Distributed asynchronous reinforcement learning framework for long-horizon LLM training, with in-flight weight updates, multi-domain rollouts, tool use, and scalable post-training workflows.
Agents · Traces · Open source
TapeAgents
Framework for building, debugging, serving, and optimizing LLM agents through structured, replayable tapes that connect engineering traces back to model improvement.
Evaluation · Agent training
CUBE-harness
Evaluation and training infrastructure for long-horizon LLM agents, focused on repeatable measurement, agentic tasks, and systems that improve through interaction.
Research taste
Current focus
- RL for reasoning models and LLM agents
- Multi-turn RLVR and long-horizon credit assignment
- Efficient on-policy training at distributed scale
- Memory-augmented agents and open-ended interaction
- Cross-domain generalisation and robust evaluation loops
- Privacy-aware deep research agents
Engineering lens
How I work
My background combines applied AI research with production software, distributed systems, networking, and infrastructure. I am most interested in research ideas that can be made concrete: implemented, measured, debugged, scaled, and released in a form other people can build on.