Back to skills
SkillHub ClubShip Full StackFull Stack

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
5,196
Hot score
99
Updated
March 20, 2026
Overall rating
C4.0
Composite score
4.0
Best-practice grade
B72.0

Install command

npx @skill-hub/cli install orchestra-research-ai-research-skills-verl
Reinforcement LearningRLHFGRPOPPOPost-TrainingDistributed Training

Repository

Orchestra-Research/AI-Research-SKILLs

Skill path: 06-post-training/verl

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: MIT.

Original source

Catalog source: SkillHub Club.

Repository owner: Orchestra-Research.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install verl-rl-training into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/Orchestra-Research/AI-Research-SKILLs before adding verl-rl-training to shared team environments
  • Use verl-rl-training for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: verl-rl-training
description: Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Reinforcement Learning, RLHF, GRPO, PPO, Post-Training, Distributed Training]
dependencies: [verl>=0.3.0, torch>=2.0.0, ray>=2.41.0, vllm>=0.8.2, transformers>=4.40.0]
---

# verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

## When to Use verl

**Choose verl when you need:**
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training

**Consider alternatives when:**
- You need Megatron-native training → use **slime** or **miles**
- You want PyTorch-native abstractions with Monarch → use **torchforge**
- You only need simple SFT/DPO → use **TRL** or **Axolotl**

## Key Features

- **Training backends**: FSDP, FSDP2, Megatron-LM
- **Rollout engines**: vLLM, SGLang, HuggingFace Transformers
- **Algorithms**: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- **Models**: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- **Advanced**: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

## Installation

```bash
# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
```

## Quick Start: GRPO Training

```bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8
```

## Core Architecture

verl uses a **HybridFlow** programming model separating control flow from computation:

```
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘
```

---

## Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

### Prerequisites Checklist
- [ ] GPU cluster with 8+ GPUs (H100 recommended)
- [ ] Dataset in parquet format with `prompt` and `reward_model` columns
- [ ] Base model from HuggingFace Hub

### Step 1: Prepare Dataset

```python
import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
```

### Step 2: Define Reward Function

```python
# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards
```

### Step 3: Create Training Config

```yaml
# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100
```

### Step 4: Launch Training

```bash
python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b
```

### Step 5: Monitor and Validate
- [ ] Check WandB/TensorBoard for loss curves
- [ ] Verify reward is increasing over steps
- [ ] Run evaluation on held-out test set

---

## Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

### Key Differences from GRPO
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards

### Configuration

```yaml
algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping
```

### Launch with Critic

```bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8
```

---

## Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

### Prerequisites
- [ ] Install Megatron-LM bridge: `pip install mbridge`
- [ ] Convert model to Megatron format
- [ ] Multi-node cluster with NVLink/InfiniBand

### Configuration for 70B+ Models

```yaml
actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8
```

### Launch Multi-Node

```bash
# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8
```

---

## Configuration Reference

### Algorithm Selection

| Algorithm | `adv_estimator` | Use Case |
|-----------|-----------------|----------|
| GRPO | `grpo` | Critic-free, math/reasoning |
| PPO/GAE | `gae` | Dense rewards, value estimation |
| REINFORCE++ | `reinforce_plus_plus` | Variance reduction |
| RLOO | `rloo` | Leave-one-out baseline |
| ReMax | `remax` | Maximum reward baseline |
| OPO | `opo` | Optimal policy optimization |

### Key Parameters

```yaml
# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control
```

---

## Common Issues and Solutions

### Issue: OOM During Rollout

**Symptoms**: CUDA out of memory during generation phase

**Solutions**:
```yaml
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
```

### Issue: Training Instability

**Symptoms**: Loss spikes, reward collapse

**Solutions**:
```yaml
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
```

### Issue: Slow Weight Sync

**Symptoms**: Long pauses between rollout and training

**Solutions**:
```bash
# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true
```

### Issue: vLLM Version Mismatch

**Symptoms**: Import errors or generation failures

**Solution**: Use compatible versions:
```bash
pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
```

---

## Advanced Topics

### Multi-Turn Tool Calling

See [references/multi-turn.md](references/multi-turn.md) for agentic workflows with tool use.

### Vision-Language Models

```yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true
```

### LoRA Training

```yaml
actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]
```

---

## Resources

- **Documentation**: https://verl.readthedocs.io/
- **Paper**: https://arxiv.org/abs/2409.19256
- **GitHub**: https://github.com/volcengine/verl
- **Recipes**: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
- **Community**: Slack at verl-project



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### references/api-reference.md

```markdown
# verl API Reference

## Core Classes

### RayPPOTrainer

The central controller for the training loop. Manages resource allocation and coordinates worker groups.

```python
from verl import RayPPOTrainer

trainer = RayPPOTrainer(
    config=config,
    resource_pool_manager=resource_manager,
    ray_worker_group_cls=RayWorkerGroup,
)
trainer.init_workers()
trainer.fit()
```

### ResourcePoolManager

Manages GPU allocation across different worker groups using Ray PlacementGroups.

```python
from verl.trainer.ppo.resource_pool import ResourcePoolManager

manager = ResourcePoolManager(
    resource_pool_spec={
        "actor_rollout_ref": {"gpu": 4},
        "critic": {"gpu": 2},
    }
)
```

### RayWorkerGroup

Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls.

```python
from verl.trainer.ppo.ray_worker_group import RayWorkerGroup

worker_group = RayWorkerGroup(
    num_workers=8,
    worker_cls=ActorRolloutRefWorker,
    resource_pool=pool,
)
```

### ActorRolloutRefWorker

Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching.

```python
# Typically configured via YAML, not instantiated directly
# See configuration section below
```

### RolloutReplica

Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace.

```python
from verl.workers.rollout import RolloutReplica

# Backend selection via config
rollout:
  name: vllm  # or: sglang, hf, tensorrt-llm
```

## Configuration Schema

### PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`)

```yaml
# Data configuration
data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256        # Global batch size of prompts
  max_prompt_length: 512
  max_response_length: 2048

# Algorithm configuration
algorithm:
  adv_estimator: gae           # gae, grpo, rloo, reinforce_plus_plus
  gamma: 0.99                  # Discount factor
  lam: 0.95                    # GAE lambda
  use_kl_in_reward: false      # Add KL term to reward

# Actor configuration
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
    backend: fsdp              # fsdp, fsdp2, megatron
  actor:
    ppo_mini_batch_size: 64    # Mini-batch for actor updates
    ppo_epochs: 1              # Number of actor update epochs
    clip_ratio: 0.2            # PPO clip range
    use_kl_loss: true          # Use KL loss in actor
    kl_loss_coef: 0.001        # KL loss coefficient
    kl_loss_type: low_var      # KL divergence calculation method
    loss_agg_mode: token-mean  # token-mean or sequence-mean
    gradient_checkpointing: true
    max_grad_norm: 1.0         # Gradient clipping
    lr: 1e-6                   # Learning rate
  rollout:
    name: vllm                 # vllm, sglang, hf
    n: 8                       # Samples per prompt
    temperature: 0.7
    top_p: 0.95
    log_prob_micro_batch_size: 8

# Critic configuration (PPO only)
critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  ppo_mini_batch_size: 64
  ppo_epochs: 1                # Defaults to actor epochs

# Trainer configuration
trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  nnodes: 1
  save_freq: 100
  experiment_name: my_experiment
  async_weight_update: false
```

### GRPO Configuration (`docs/algo/grpo.md`)

```yaml
algorithm:
  adv_estimator: grpo          # Enable GRPO
  gamma: 1.0
  lam: 1.0

actor_rollout_ref:
  rollout:
    n: 8                       # Must be > 1 for GRPO
  actor:
    use_kl_loss: true          # Required for GRPO
    kl_loss_coef: 0.001
    kl_loss_type: low_var      # or: k1, k2, k3
    loss_agg_mode: token-mean
```

### Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`)

```yaml
actor_rollout_ref:
  rollout:
    name: sglang               # Required for multi-turn
    multi_turn:
      enable: true
      tool_config_path: /path/to/tools.yaml
      interaction_config_path: /path/to/interaction.yaml
```

## Reward Functions

### Built-in Reward Types

```yaml
# Model-based reward
reward_model:
  path: OpenRLHF/Llama-3-8b-rm-700k

# Custom function-based reward
custom_reward_function:
  path: /path/to/reward.py
  name: compute_score          # Function name, default: compute_score
```

### Custom Reward Function Signature

```python
# reward.py
def compute_score(responses: list[str], ground_truths: list[str], **kwargs) -> list[float]:
    """
    Compute rewards for a batch of responses.

    Args:
        responses: Generated completions
        ground_truths: Expected answers from data
        **kwargs: Additional metadata

    Returns:
        List of reward scores (floats)
    """
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Your reward logic
        score = 1.0 if correct(response, gt) else 0.0
        rewards.append(score)
    return rewards
```

## Backend-Specific Configuration

### FSDP Configuration

```yaml
actor_rollout_ref:
  actor:
    strategy: fsdp
    fsdp_config:
      mixed_precision: bf16
      sharding_strategy: FULL_SHARD
      offload_policy: false
```

### FSDP2 Configuration

```yaml
actor_rollout_ref:
  actor:
    strategy: fsdp2
    fsdp_config:
      offload_policy: true     # CPU offloading
      reshard_after_forward: true
```

### Megatron Configuration

```yaml
actor_rollout_ref:
  model:
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
    megatron:
      use_mbridge: true        # Required for format conversion
```

### vLLM Rollout Configuration

```yaml
actor_rollout_ref:
  rollout:
    name: vllm
    tensor_parallel_size: 2
    gpu_memory_utilization: 0.9
    max_num_seqs: 256
    enforce_eager: false
```

### SGLang Rollout Configuration

```yaml
actor_rollout_ref:
  rollout:
    name: sglang
    tp_size: 2
    mem_fraction_static: 0.8
    context_length: 8192
```

## Algorithm Reference

| Algorithm | `adv_estimator` | Requires Critic | Best For |
|-----------|-----------------|-----------------|----------|
| PPO | `gae` | Yes | Dense rewards, value estimation |
| GRPO | `grpo` | No | Sparse rewards, math/reasoning |
| RLOO | `rloo` | No | Leave-one-out baseline |
| REINFORCE++ | `reinforce_plus_plus` | No | Variance reduction |
| DAPO | `dapo` | No | Doubly-adaptive optimization |

## Vision-Language Model Support

```yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true
    max_model_len: 32768
```

## LoRA Configuration

```yaml
actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
      dropout: 0.05
```

## Resources

- Documentation: https://verl.readthedocs.io/
- GitHub: https://github.com/volcengine/verl
- Paper: https://arxiv.org/abs/2409.19256 (HybridFlow)

```

### references/troubleshooting.md

```markdown
# verl Troubleshooting Guide

## Common Issues and Solutions

### OOM (Out of Memory) Issues

#### Issue: OOM During Rollout

**Symptoms**: CUDA out of memory during generation phase

**Solutions**:

1. **Reduce log prob batch size**:
```yaml
actor_rollout_ref:
  rollout:
    log_prob_micro_batch_size: 4  # Reduce from 8
```

2. **Enable gradient checkpointing**:
```yaml
actor_rollout_ref:
  actor:
    gradient_checkpointing: true
```

3. **Use FSDP2 with CPU offloading**:
```yaml
actor_rollout_ref:
  actor:
    strategy: fsdp2
    fsdp_config:
      offload_policy: true
```

4. **Reduce vLLM memory utilization**:
```yaml
actor_rollout_ref:
  rollout:
    gpu_memory_utilization: 0.7  # Reduce from 0.9
```

#### Issue: OOM During Training

**Symptoms**: CUDA OOM in backward pass

**Solutions**:

1. **Reduce batch sizes**:
```yaml
actor_rollout_ref:
  actor:
    ppo_mini_batch_size: 32  # Reduce from 64
```

2. **Use gradient accumulation**:
```yaml
actor_rollout_ref:
  actor:
    gradient_accumulation_steps: 4
```

3. **Enable mixed precision**:
```yaml
actor_rollout_ref:
  actor:
    fsdp_config:
      mixed_precision: bf16
```

### Training Stability Issues

#### Issue: Training Instability / Loss Spikes

**Symptoms**: Loss spikes, reward collapse, divergence

**Solutions**:

1. **Reduce learning rate**:
```yaml
actor_rollout_ref:
  actor:
    lr: 5e-7  # Reduce from 1e-6
```

2. **Increase KL penalty**:
```yaml
actor_rollout_ref:
  actor:
    kl_loss_coef: 0.01  # Increase from 0.001
```

3. **Enable gradient clipping**:
```yaml
actor_rollout_ref:
  actor:
    max_grad_norm: 1.0
```

4. **Use smaller PPO clip range**:
```yaml
actor_rollout_ref:
  actor:
    clip_ratio: 0.1  # Reduce from 0.2
```

#### Issue: Policy Collapse (Entropy Drops to Zero)

**Symptoms**: Model outputs become deterministic, entropy approaches zero

**Solutions**:

1. **Increase temperature during rollout**:
```yaml
actor_rollout_ref:
  rollout:
    temperature: 0.9  # Increase from 0.7
```

2. **Add entropy bonus**:
```yaml
algorithm:
  entropy_coef: 0.01
```

3. **Reduce KL penalty**:
```yaml
actor_rollout_ref:
  actor:
    kl_loss_coef: 0.0001  # Reduce
```

### Weight Synchronization Issues

#### Issue: Slow Weight Sync

**Symptoms**: Long pauses between rollout and training phases

**Solutions**:

1. **Use FSDP2 for faster resharding**:
```yaml
actor_rollout_ref:
  actor:
    strategy: fsdp2
```

2. **Enable async weight transfer**:
```yaml
trainer:
  async_weight_update: true
```

3. **Reduce sync frequency**:
```yaml
trainer:
  weight_sync_interval: 2  # Sync every 2 steps
```

#### Issue: Weight Sync Timeout

**Symptoms**: Ray actor timeouts during weight synchronization

**Solutions**:

1. **Increase Ray timeout**:
```python
import ray
ray.init(num_gpus=8, timeout=3600)  # 1 hour timeout
```

2. **Use colocated mode** (if memory allows):
```yaml
trainer:
  colocate_actor_ref: true
```

### vLLM Version Issues

#### Issue: vLLM Import Errors or Generation Failures

**Symptoms**: Import errors, generation hangs, incorrect outputs

**Solutions**:

1. **Use compatible vLLM version**:
```bash
pip install vllm>=0.8.2,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
```

2. **For vLLM 0.8.x issues**:
```yaml
actor_rollout_ref:
  rollout:
    enforce_eager: true  # Disable CUDA graphs
```

3. **Check CUDA version compatibility**:
```bash
# vLLM 0.11+ requires CUDA 12.1+
nvidia-smi  # Check CUDA version
```

### Ray Issues

#### Issue: Ray Cluster Connection Failures

**Symptoms**: Cannot connect to Ray cluster

**Solutions**:

1. **Check Ray head node**:
```bash
ray status
```

2. **Restart Ray cluster**:
```bash
ray stop
ray start --head --port=6379 --num-gpus=8
```

3. **Verify network connectivity**:
```bash
ping head_node_ip
```

#### Issue: Ray Actor OOM

**Symptoms**: Ray actors killed due to OOM

**Solutions**:

1. **Increase Ray object store memory**:
```bash
ray start --head --object-store-memory=10000000000  # 10GB
```

2. **Enable spilling to disk**:
```bash
export RAY_object_spilling_config='{"type":"filesystem","params":{"directory_path":"/tmp/ray_spill"}}'
```

### Multi-Node Issues

#### Issue: NCCL Timeout

**Symptoms**: NCCL operations timeout on multi-node

**Solutions**:

1. **Set NCCL environment variables**:
```bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0  # Enable InfiniBand if available
```

2. **Increase NCCL timeout**:
```bash
export NCCL_TIMEOUT=1800  # 30 minutes
```

3. **Check network interface**:
```bash
ifconfig  # Verify correct interface
```

#### Issue: DeepSpeed GPU Index Out of Range

**Symptoms**: "GPU index out of range" error with DeepSpeed

**Solutions**:

```bash
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
```

### Data Issues

#### Issue: Empty Batches

**Symptoms**: Training receives empty batches

**Solutions**:

1. **Verify data format**:
```python
import pandas as pd
df = pd.read_parquet("train.parquet")
print(df.columns)  # Should include 'prompt', 'reward_model'
```

2. **Check data loading**:
```yaml
data:
  train_files: /absolute/path/to/train.parquet  # Use absolute path
```

#### Issue: Tokenization Errors

**Symptoms**: Tokenizer errors, sequence length mismatches

**Solutions**:

1. **Set padding token**:
```python
tokenizer.pad_token = tokenizer.eos_token
```

2. **Verify max length configuration**:
```yaml
data:
  max_prompt_length: 512
  max_response_length: 2048
# Total should not exceed model's max length
```

### Megatron-Specific Issues

#### Issue: Megatron Checkpoint Loading Fails

**Symptoms**: Cannot load Megatron checkpoints

**Solutions**:

1. **Enable mbridge conversion**:
```yaml
actor_rollout_ref:
  actor:
    megatron:
      use_mbridge: true
```

2. **Convert HuggingFace to Megatron format**:
```bash
python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint
```

#### Issue: Megatron on AMD GPUs

**Current Limitation**: Megatron-LM backend is not supported on AMD GPUs. Use FSDP backend instead:

```yaml
actor_rollout_ref:
  model:
    backend: fsdp
```

### Debugging Tips

#### Enable Verbose Logging

```yaml
trainer:
  logging_level: DEBUG
```

```bash
export VERL_DEBUG=1
export RAY_DEDUP_LOGS=0
```

#### Check GPU Utilization

```bash
watch -n 1 nvidia-smi
```

#### Profile Training

```python
# Add profiling to training loop
import torch.profiler

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    trainer.fit()
prof.export_chrome_trace("trace.json")
```

## Resources

- GitHub Issues: https://github.com/volcengine/verl/issues
- Documentation: https://verl.readthedocs.io/
- Community Slack: verl-project

```