RFC: Add SGLang backend to Nemo-RL

飞书用户1203

飞书用户4781

11月19日修改

Motivation

To enhance NeMo-RL’s compatibility with the rapidly expanding SGLang RL community, we are proposing the integration of SGLang as a supported, high-performance rollout backend. SGLang offers irreplaceable value to the NeMo ecosystem due to our large, fast-growing RL user base and proven industry adoption. Incorporating SGLang and co-designing our RL systems will substantially benefit the NeMo ecosystem by:​

1.
attracting massive communities of SGLang developers and researchers, expanding NeMo-RL's user base and influence;​

2.
boosting training efficiency by leveraging SGLang's high-throughput data collection capabilities, and ensuring NeMo-RL remains at the forefront of the field through collaboration with SGLang's influential partners.​

Proposal

To integrate SGLang cleanly into NeMo RL’s inference subsystem, we propose implementing a full backend under nemo_rl/models/generation/sglang/​

The following components will be introduced:

•
config.py— Define arguments with the knobs SGLang needs​

•
sglang_generation.py — The SGLang implementation of GenerationInterface. ​
◦
Init: validate SglangConfig, set up Ray placement groups + RayWorkerGroup, create worker actors with the right runtime env (Python env for SGLang process).​
◦
lifecycle methods: expose prepare_for_generation / generate/ finish_generation call down to the workers.​
◦
refit hooks: expose prepare_refit_info / update_weights_from_tensor / update_weights_from_distributed call down to the workers.​

•
sglang_worker.py — the Ray worker that launches and manages an SGLang engine. Here we plan to implement the following part.​
◦
init: store config, rank, and defer engine details.​
◦
post_init: build SGLang arguments, start engine here.​
◦
generate: translate BatchedDataDict into SGLang generate requests and wrap responses back into generation outputs.​
◦
prepare_refit_info: cache tensor metadata for upcoming weight pushes.​
◦
update_weights_from_tensor: call update_weights_from_tensor with serialized tensors and cache-flush controls for co-located placement​
◦
update_weights_from_distributed: call update_weights_from_distributed API for disaggregated placement.​
◦
sleep / wake_up:  implement sleep() via SGLang’s pause_generation + release_memory_occupation, and wake_up() via resume_memory_occupation + continue_generation.​
◦
init_collective: stub or minimal hook to satisfy GenerationInterface.​

Timeline

People Involved

​
From the SGLang side, Xinyi Song and Tianrui Liu will be the core developers. Chenyang Zhao, leader of the  SGLang RL team, will be the PoC. It would be helpful if we could have an experienced engineer from the NeMo-RL team to co-develop with us.​

Future Plan

1.
Nemo developed a Generation Interface that facilitates the integration of other inference engines. But in NeMo RL’s current structure, the generation modules concentrate Ray orchestration, worker glue, sampling, and refit hooks into a few large files, so new backends must mirror a lot of logic instead of plugging into a smaller abstraction. And some vLLM-specific branches leak into distributed/worker_groups.py and policy/. Introducing a BaseGenerationBackend plus a simple backend registry would centralize common behaviors and make adding engines like SGLang much easier.​

2.
Engine-based RL still inherits several structural problems: the rollout logic is tightly bound to the inference engine, upgrades require synchronized changes across frameworks, and multi-turn rollouts suffer from strict process-level barriers. SGLang proposes using an HTTP-based inference server to address these issues. A server API fully decouples RL logic from the engine, enabling independent evolution of SGLang or any other backend, reducing dependency on Ray, simplifying integration with external frameworks, and supporting distributed, asynchronous rollout pipelines that avoid the traditional engine-based bottlenecks.​

RFC: Add SGLang backend to Nemo-RL​

RFC: Add SGLang backend to Nemo-RL