verl-rl-training
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install orchestra-research-ai-research-skills-verl
Repository
Skill path: 06-post-training/verl
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack.
Target audience: everyone.
License: MIT.
Original source
Catalog source: SkillHub Club.
Repository owner: Orchestra-Research.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install verl-rl-training into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Orchestra-Research/AI-Research-SKILLs before adding verl-rl-training to shared team environments
- Use verl-rl-training for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: verl-rl-training
description: Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Reinforcement Learning, RLHF, GRPO, PPO, Post-Training, Distributed Training]
dependencies: [verl>=0.3.0, torch>=2.0.0, ray>=2.41.0, vllm>=0.8.2, transformers>=4.40.0]
---
# verl: Volcano Engine Reinforcement Learning for LLMs
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
## When to Use verl
**Choose verl when you need:**
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training
**Consider alternatives when:**
- You need Megatron-native training → use **slime** or **miles**
- You want PyTorch-native abstractions with Monarch → use **torchforge**
- You only need simple SFT/DPO → use **TRL** or **Axolotl**
## Key Features
- **Training backends**: FSDP, FSDP2, Megatron-LM
- **Rollout engines**: vLLM, SGLang, HuggingFace Transformers
- **Algorithms**: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- **Models**: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- **Advanced**: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
## Installation
```bash
# Option 1: pip install
pip install verl[vllm] # or verl[sglang] for SGLang backend
# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest
# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
```
## Quick Start: GRPO Training
```bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8
```
## Core Architecture
verl uses a **HybridFlow** programming model separating control flow from computation:
```
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray) │
│ - Orchestrates: rollout → reward → train → sync │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers │
│ ├── ActorRolloutRefWorker (policy + generation) │
│ ├── CriticWorker (value estimation, PPO only) │
│ └── RewardManager (model-based or rule-based rewards) │
└─────────────────────────────────────────────────────────┘
```
---
## Workflow 1: Math Reasoning with GRPO
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
### Prerequisites Checklist
- [ ] GPU cluster with 8+ GPUs (H100 recommended)
- [ ] Dataset in parquet format with `prompt` and `reward_model` columns
- [ ] Base model from HuggingFace Hub
### Step 1: Prepare Dataset
```python
import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
```
### Step 2: Define Reward Function
```python
# reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
```
### Step 3: Create Training Config
```yaml
# config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # samples per prompt
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
```
### Step 4: Launch Training
```bash
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7b
```
### Step 5: Monitor and Validate
- [ ] Check WandB/TensorBoard for loss curves
- [ ] Verify reward is increasing over steps
- [ ] Run evaluation on held-out test set
---
## Workflow 2: PPO with Critic Model
Use this workflow when you need value-based advantage estimation (GAE).
### Key Differences from GRPO
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards
### Configuration
```yaml
algorithm:
adv_estimator: gae # Use GAE instead of GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO clipping
```
### Launch with Critic
```bash
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8
```
---
## Workflow 3: Large-Scale Training with Megatron
Use this workflow for models >70B parameters or when you need expert parallelism.
### Prerequisites
- [ ] Install Megatron-LM bridge: `pip install mbridge`
- [ ] Convert model to Megatron format
- [ ] Multi-node cluster with NVLink/InfiniBand
### Configuration for 70B+ Models
```yaml
actor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8
```
### Launch Multi-Node
```bash
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address='head_ip:6379'
# Launch training
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 \
trainer.n_gpus_per_node=8
```
---
## Configuration Reference
### Algorithm Selection
| Algorithm | `adv_estimator` | Use Case |
|-----------|-----------------|----------|
| GRPO | `grpo` | Critic-free, math/reasoning |
| PPO/GAE | `gae` | Dense rewards, value estimation |
| REINFORCE++ | `reinforce_plus_plus` | Variance reduction |
| RLOO | `rloo` | Leave-one-out baseline |
| ReMax | `remax` | Maximum reward baseline |
| OPO | `opo` | Optimal policy optimization |
### Key Parameters
```yaml
# Rollout parameters
actor_rollout_ref.rollout.n: 8 # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
# Training parameters
actor_rollout_ref.actor.lr: 1e-6 # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
```
---
## Common Issues and Solutions
### Issue: OOM During Rollout
**Symptoms**: CUDA out of memory during generation phase
**Solutions**:
```yaml
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true
# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
```
### Issue: Training Instability
**Symptoms**: Loss spikes, reward collapse
**Solutions**:
```yaml
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7
# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01
# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
```
### Issue: Slow Weight Sync
**Symptoms**: Long pauses between rollout and training
**Solutions**:
```bash
# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2
# Enable async weight transfer
trainer.async_weight_update=true
```
### Issue: vLLM Version Mismatch
**Symptoms**: Import errors or generation failures
**Solution**: Use compatible versions:
```bash
pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
```
---
## Advanced Topics
### Multi-Turn Tool Calling
See [references/multi-turn.md](references/multi-turn.md) for agentic workflows with tool use.
### Vision-Language Models
```yaml
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
```
### LoRA Training
```yaml
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
```
---
## Resources
- **Documentation**: https://verl.readthedocs.io/
- **Paper**: https://arxiv.org/abs/2409.19256
- **GitHub**: https://github.com/volcengine/verl
- **Recipes**: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
- **Community**: Slack at verl-project
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### references/api-reference.md
```markdown
# verl API Reference
## Core Classes
### RayPPOTrainer
The central controller for the training loop. Manages resource allocation and coordinates worker groups.
```python
from verl import RayPPOTrainer
trainer = RayPPOTrainer(
config=config,
resource_pool_manager=resource_manager,
ray_worker_group_cls=RayWorkerGroup,
)
trainer.init_workers()
trainer.fit()
```
### ResourcePoolManager
Manages GPU allocation across different worker groups using Ray PlacementGroups.
```python
from verl.trainer.ppo.resource_pool import ResourcePoolManager
manager = ResourcePoolManager(
resource_pool_spec={
"actor_rollout_ref": {"gpu": 4},
"critic": {"gpu": 2},
}
)
```
### RayWorkerGroup
Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls.
```python
from verl.trainer.ppo.ray_worker_group import RayWorkerGroup
worker_group = RayWorkerGroup(
num_workers=8,
worker_cls=ActorRolloutRefWorker,
resource_pool=pool,
)
```
### ActorRolloutRefWorker
Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching.
```python
# Typically configured via YAML, not instantiated directly
# See configuration section below
```
### RolloutReplica
Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace.
```python
from verl.workers.rollout import RolloutReplica
# Backend selection via config
rollout:
name: vllm # or: sglang, hf, tensorrt-llm
```
## Configuration Schema
### PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`)
```yaml
# Data configuration
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256 # Global batch size of prompts
max_prompt_length: 512
max_response_length: 2048
# Algorithm configuration
algorithm:
adv_estimator: gae # gae, grpo, rloo, reinforce_plus_plus
gamma: 0.99 # Discount factor
lam: 0.95 # GAE lambda
use_kl_in_reward: false # Add KL term to reward
# Actor configuration
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
backend: fsdp # fsdp, fsdp2, megatron
actor:
ppo_mini_batch_size: 64 # Mini-batch for actor updates
ppo_epochs: 1 # Number of actor update epochs
clip_ratio: 0.2 # PPO clip range
use_kl_loss: true # Use KL loss in actor
kl_loss_coef: 0.001 # KL loss coefficient
kl_loss_type: low_var # KL divergence calculation method
loss_agg_mode: token-mean # token-mean or sequence-mean
gradient_checkpointing: true
max_grad_norm: 1.0 # Gradient clipping
lr: 1e-6 # Learning rate
rollout:
name: vllm # vllm, sglang, hf
n: 8 # Samples per prompt
temperature: 0.7
top_p: 0.95
log_prob_micro_batch_size: 8
# Critic configuration (PPO only)
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct
ppo_mini_batch_size: 64
ppo_epochs: 1 # Defaults to actor epochs
# Trainer configuration
trainer:
total_epochs: 3
n_gpus_per_node: 8
nnodes: 1
save_freq: 100
experiment_name: my_experiment
async_weight_update: false
```
### GRPO Configuration (`docs/algo/grpo.md`)
```yaml
algorithm:
adv_estimator: grpo # Enable GRPO
gamma: 1.0
lam: 1.0
actor_rollout_ref:
rollout:
n: 8 # Must be > 1 for GRPO
actor:
use_kl_loss: true # Required for GRPO
kl_loss_coef: 0.001
kl_loss_type: low_var # or: k1, k2, k3
loss_agg_mode: token-mean
```
### Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`)
```yaml
actor_rollout_ref:
rollout:
name: sglang # Required for multi-turn
multi_turn:
enable: true
tool_config_path: /path/to/tools.yaml
interaction_config_path: /path/to/interaction.yaml
```
## Reward Functions
### Built-in Reward Types
```yaml
# Model-based reward
reward_model:
path: OpenRLHF/Llama-3-8b-rm-700k
# Custom function-based reward
custom_reward_function:
path: /path/to/reward.py
name: compute_score # Function name, default: compute_score
```
### Custom Reward Function Signature
```python
# reward.py
def compute_score(responses: list[str], ground_truths: list[str], **kwargs) -> list[float]:
"""
Compute rewards for a batch of responses.
Args:
responses: Generated completions
ground_truths: Expected answers from data
**kwargs: Additional metadata
Returns:
List of reward scores (floats)
"""
rewards = []
for response, gt in zip(responses, ground_truths):
# Your reward logic
score = 1.0 if correct(response, gt) else 0.0
rewards.append(score)
return rewards
```
## Backend-Specific Configuration
### FSDP Configuration
```yaml
actor_rollout_ref:
actor:
strategy: fsdp
fsdp_config:
mixed_precision: bf16
sharding_strategy: FULL_SHARD
offload_policy: false
```
### FSDP2 Configuration
```yaml
actor_rollout_ref:
actor:
strategy: fsdp2
fsdp_config:
offload_policy: true # CPU offloading
reshard_after_forward: true
```
### Megatron Configuration
```yaml
actor_rollout_ref:
model:
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
megatron:
use_mbridge: true # Required for format conversion
```
### vLLM Rollout Configuration
```yaml
actor_rollout_ref:
rollout:
name: vllm
tensor_parallel_size: 2
gpu_memory_utilization: 0.9
max_num_seqs: 256
enforce_eager: false
```
### SGLang Rollout Configuration
```yaml
actor_rollout_ref:
rollout:
name: sglang
tp_size: 2
mem_fraction_static: 0.8
context_length: 8192
```
## Algorithm Reference
| Algorithm | `adv_estimator` | Requires Critic | Best For |
|-----------|-----------------|-----------------|----------|
| PPO | `gae` | Yes | Dense rewards, value estimation |
| GRPO | `grpo` | No | Sparse rewards, math/reasoning |
| RLOO | `rloo` | No | Leave-one-out baseline |
| REINFORCE++ | `reinforce_plus_plus` | No | Variance reduction |
| DAPO | `dapo` | No | Doubly-adaptive optimization |
## Vision-Language Model Support
```yaml
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
max_model_len: 32768
```
## LoRA Configuration
```yaml
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
dropout: 0.05
```
## Resources
- Documentation: https://verl.readthedocs.io/
- GitHub: https://github.com/volcengine/verl
- Paper: https://arxiv.org/abs/2409.19256 (HybridFlow)
```
### references/troubleshooting.md
```markdown
# verl Troubleshooting Guide
## Common Issues and Solutions
### OOM (Out of Memory) Issues
#### Issue: OOM During Rollout
**Symptoms**: CUDA out of memory during generation phase
**Solutions**:
1. **Reduce log prob batch size**:
```yaml
actor_rollout_ref:
rollout:
log_prob_micro_batch_size: 4 # Reduce from 8
```
2. **Enable gradient checkpointing**:
```yaml
actor_rollout_ref:
actor:
gradient_checkpointing: true
```
3. **Use FSDP2 with CPU offloading**:
```yaml
actor_rollout_ref:
actor:
strategy: fsdp2
fsdp_config:
offload_policy: true
```
4. **Reduce vLLM memory utilization**:
```yaml
actor_rollout_ref:
rollout:
gpu_memory_utilization: 0.7 # Reduce from 0.9
```
#### Issue: OOM During Training
**Symptoms**: CUDA OOM in backward pass
**Solutions**:
1. **Reduce batch sizes**:
```yaml
actor_rollout_ref:
actor:
ppo_mini_batch_size: 32 # Reduce from 64
```
2. **Use gradient accumulation**:
```yaml
actor_rollout_ref:
actor:
gradient_accumulation_steps: 4
```
3. **Enable mixed precision**:
```yaml
actor_rollout_ref:
actor:
fsdp_config:
mixed_precision: bf16
```
### Training Stability Issues
#### Issue: Training Instability / Loss Spikes
**Symptoms**: Loss spikes, reward collapse, divergence
**Solutions**:
1. **Reduce learning rate**:
```yaml
actor_rollout_ref:
actor:
lr: 5e-7 # Reduce from 1e-6
```
2. **Increase KL penalty**:
```yaml
actor_rollout_ref:
actor:
kl_loss_coef: 0.01 # Increase from 0.001
```
3. **Enable gradient clipping**:
```yaml
actor_rollout_ref:
actor:
max_grad_norm: 1.0
```
4. **Use smaller PPO clip range**:
```yaml
actor_rollout_ref:
actor:
clip_ratio: 0.1 # Reduce from 0.2
```
#### Issue: Policy Collapse (Entropy Drops to Zero)
**Symptoms**: Model outputs become deterministic, entropy approaches zero
**Solutions**:
1. **Increase temperature during rollout**:
```yaml
actor_rollout_ref:
rollout:
temperature: 0.9 # Increase from 0.7
```
2. **Add entropy bonus**:
```yaml
algorithm:
entropy_coef: 0.01
```
3. **Reduce KL penalty**:
```yaml
actor_rollout_ref:
actor:
kl_loss_coef: 0.0001 # Reduce
```
### Weight Synchronization Issues
#### Issue: Slow Weight Sync
**Symptoms**: Long pauses between rollout and training phases
**Solutions**:
1. **Use FSDP2 for faster resharding**:
```yaml
actor_rollout_ref:
actor:
strategy: fsdp2
```
2. **Enable async weight transfer**:
```yaml
trainer:
async_weight_update: true
```
3. **Reduce sync frequency**:
```yaml
trainer:
weight_sync_interval: 2 # Sync every 2 steps
```
#### Issue: Weight Sync Timeout
**Symptoms**: Ray actor timeouts during weight synchronization
**Solutions**:
1. **Increase Ray timeout**:
```python
import ray
ray.init(num_gpus=8, timeout=3600) # 1 hour timeout
```
2. **Use colocated mode** (if memory allows):
```yaml
trainer:
colocate_actor_ref: true
```
### vLLM Version Issues
#### Issue: vLLM Import Errors or Generation Failures
**Symptoms**: Import errors, generation hangs, incorrect outputs
**Solutions**:
1. **Use compatible vLLM version**:
```bash
pip install vllm>=0.8.2,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
```
2. **For vLLM 0.8.x issues**:
```yaml
actor_rollout_ref:
rollout:
enforce_eager: true # Disable CUDA graphs
```
3. **Check CUDA version compatibility**:
```bash
# vLLM 0.11+ requires CUDA 12.1+
nvidia-smi # Check CUDA version
```
### Ray Issues
#### Issue: Ray Cluster Connection Failures
**Symptoms**: Cannot connect to Ray cluster
**Solutions**:
1. **Check Ray head node**:
```bash
ray status
```
2. **Restart Ray cluster**:
```bash
ray stop
ray start --head --port=6379 --num-gpus=8
```
3. **Verify network connectivity**:
```bash
ping head_node_ip
```
#### Issue: Ray Actor OOM
**Symptoms**: Ray actors killed due to OOM
**Solutions**:
1. **Increase Ray object store memory**:
```bash
ray start --head --object-store-memory=10000000000 # 10GB
```
2. **Enable spilling to disk**:
```bash
export RAY_object_spilling_config='{"type":"filesystem","params":{"directory_path":"/tmp/ray_spill"}}'
```
### Multi-Node Issues
#### Issue: NCCL Timeout
**Symptoms**: NCCL operations timeout on multi-node
**Solutions**:
1. **Set NCCL environment variables**:
```bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0 # Enable InfiniBand if available
```
2. **Increase NCCL timeout**:
```bash
export NCCL_TIMEOUT=1800 # 30 minutes
```
3. **Check network interface**:
```bash
ifconfig # Verify correct interface
```
#### Issue: DeepSpeed GPU Index Out of Range
**Symptoms**: "GPU index out of range" error with DeepSpeed
**Solutions**:
```bash
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
```
### Data Issues
#### Issue: Empty Batches
**Symptoms**: Training receives empty batches
**Solutions**:
1. **Verify data format**:
```python
import pandas as pd
df = pd.read_parquet("train.parquet")
print(df.columns) # Should include 'prompt', 'reward_model'
```
2. **Check data loading**:
```yaml
data:
train_files: /absolute/path/to/train.parquet # Use absolute path
```
#### Issue: Tokenization Errors
**Symptoms**: Tokenizer errors, sequence length mismatches
**Solutions**:
1. **Set padding token**:
```python
tokenizer.pad_token = tokenizer.eos_token
```
2. **Verify max length configuration**:
```yaml
data:
max_prompt_length: 512
max_response_length: 2048
# Total should not exceed model's max length
```
### Megatron-Specific Issues
#### Issue: Megatron Checkpoint Loading Fails
**Symptoms**: Cannot load Megatron checkpoints
**Solutions**:
1. **Enable mbridge conversion**:
```yaml
actor_rollout_ref:
actor:
megatron:
use_mbridge: true
```
2. **Convert HuggingFace to Megatron format**:
```bash
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
```
#### Issue: Megatron on AMD GPUs
**Current Limitation**: Megatron-LM backend is not supported on AMD GPUs. Use FSDP backend instead:
```yaml
actor_rollout_ref:
model:
backend: fsdp
```
### Debugging Tips
#### Enable Verbose Logging
```yaml
trainer:
logging_level: DEBUG
```
```bash
export VERL_DEBUG=1
export RAY_DEDUP_LOGS=0
```
#### Check GPU Utilization
```bash
watch -n 1 nvidia-smi
```
#### Profile Training
```python
# Add profiling to training loop
import torch.profiler
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
trainer.fit()
prof.export_chrome_trace("trace.json")
```
## Resources
- GitHub Issues: https://github.com/volcengine/verl/issues
- Documentation: https://verl.readthedocs.io/
- Community Slack: verl-project
```