RadixArk releases Miles, an open-source PyTorch-native stack for large-scale LLM RL post-training. The framework integrates SGLang 0.4.0+, Megatron-LM 1.6.0, and Ray 2.40+ to cut custom pipeline development time from 3 weeks to 2 days for teams using NVIDIA GPU clusters.
Miles integrates four open-source tools into a single cohesive stack: SGLang 0.4.0+ for high-throughput LLM rollout generation, NVIDIA Megatron-LM 1.6.0 as the production training backend, Ray 2.40+ for cluster-wide orchestration, and native PyTorch 2.5+ for custom model and algorithm extensibility PyTorch Blog.
All four tools are connected via a compact, 1,200-line pluggable core trainer that users can modify for specific use cases without forking the entire Miles codebase. This is compared to the 12,000+ lines of custom glue code typically required for equivalent bespoke RL post-training stacks arXiv:2606.28457.
Miles PyTorch-Native LLM RL Post-Training Stack: Modular Architecture for Large-Scale Reinforcement Learning
Miles uses a 1,800-line small-core, many-edges architecture. The core RL training loop is kept intentionally compact to reduce customization overhead for research and production teams arXiv:2606.28457.
User-modifiable components — including rollout logic, reward computation, loss functions, sample filtering, and training-loop hooks — are attached at launch via user-supplied Python modules. Teams do not need to fork the full framework to adjust for new RL algorithms or production constraints. The framework includes 18 pre-built hooks for common RL post-training algorithms including PPO, GRPO, and DPO out of the box PyTorch Blog.
The framework ships with unified low-precision recipes that apply consistently across SGLang rollout generation and Megatron-LM training steps. It supports FP8 and BF16 precision with no measurable accuracy degradation on standard RLHF benchmarks including RewardBench and MT-Bench for models up to 405B parameters arXiv:2606.28406.
Ray Orchestration Enables Asynchronous, Rack-Aware Deployment
All long-lived Miles processes — including trainer ranks, SGLang rollout servers, routing proxies, and asynchronous rollout workers — run as Ray 2.40+ actors. The framework leverages Ray’s GPU-aware scheduler and placement groups for flexible cluster deployment on nodes with 1 to 8 NVIDIA H100 GPUs each PyTorch Blog.
The framework supports both disaggregated layouts (rollout generation and training on separate node groups) and colocated layouts (both phases on the same nodes). Rack-aware placement groups distinguish between isolated single-GPU failures and full rack outages to trigger targeted recovery without halting the full training job. This design reduces unplanned downtime for 7+ day training runs by 95% compared to non-rack-aware orchestration stacks arXiv:2606.28374.
For bulk weight transfer between rollout and training components, Ray manages the control path. Raw tensor bytes move over dedicated NCCL/RDMA channels at 400GB/s, combining Ray’s native programmability with high-speed, low-latency data transfer PyTorch Blog.
Miles inherits Ray’s built-in job supervision, log aggregation, and web dashboard visibility. Its fault tolerance logic recovers failed training ranks in under 2 minutes automatically to keep week-long (7+ day) training workloads running without interruption. This has been validated on 10,000+ GPU hours of continuous training runs arXiv:2606.28374.
It also supports fully asynchronous RL mode, where rollout actors continuously stream generated samples to a shared 1TB queue that the trainer drains at its own pace. This eliminates blocking between rollout generation and training steps. This mode improves overall cluster GPU utilization by 28% for 70B MoE model training workloads arXiv:2606.28406.
Megatron-LM Integration Removes Training Backend Abstraction Overhead
Miles plugs directly into Megatron-LM 1.6.0’s native argument parser, model-construction pipeline, parallelism primitives, and distributed checkpoint format. It supports tensor parallelism, pipeline parallelism, and expert parallelism for MoE models out of the box PyTorch Blog.
This direct integration gives the framework out-of-the-box support for frontier-scale dense and mixture-of-experts (MoE) model parallelism. Built-in MoE-aware rollout and training alignment ensures expert routing behavior stays consistent across LLM generation and training phases. This alignment eliminates the 1.2% average accuracy drop on RewardBench seen in non-aligned MoE RL post-training pipelines arXiv:2606.28457.
