deepspeed
Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install orchestra-research-ai-research-skills-deepspeed
Repository
Skill path: 08-distributed-training/deepspeed
Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack.
Target audience: everyone.
License: MIT.
Original source
Catalog source: SkillHub Club.
Repository owner: Orchestra-Research.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install deepspeed into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Orchestra-Research/AI-Research-SKILLs before adding deepspeed to shared team environments
- Use deepspeed for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: deepspeed
description: Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [DeepSpeed, Distributed Training, ZeRO, Pipeline Parallelism, Mixed Precision, Optimization, Microsoft, Large-Scale Training, FP16, FP8]
dependencies: [deepspeed, torch, transformers, accelerate]
---
# Deepspeed Skill
Comprehensive assistance with deepspeed development, generated from official documentation.
## When to Use This Skill
This skill should be triggered when:
- Working with deepspeed
- Asking about deepspeed features or APIs
- Implementing deepspeed solutions
- Debugging deepspeed code
- Learning deepspeed best practices
## Quick Reference
### Common Patterns
**Pattern 1:** DeepNVMe Contents Requirements Creating DeepNVMe Handles Using DeepNVMe Handles Blocking File Write Non-Blocking File Write Parallel File Write Pinned Tensors Putting it together Acknowledgements Appendix Advanced Handle Creation Performance Tuning DeepNVMe APIs General I/O APIs GDS-specific APIs Handle Settings APIs This tutorial will show how to use DeepNVMe for data transfers between persistent storage and tensors residing in host or device memory. DeepNVMe improves the performance and efficiency of I/O operations in Deep Learning applications through powerful optimizations built on Non-Volatile Memory Express (NVMe) Solid State Drives (SSDs), Linux Asynchronous I/O (libaio), and NVIDIA Magnum IOTM GPUDirect® Storage (GDS). Requirements Ensure your environment is properly configured to use DeepNVMe. First, you need to install DeepSpeed version >= 0.15.0. Next, ensure that the DeepNVMe operators are available in the DeepSpeed installation. The async_io operator is required for any DeepNVMe functionality, while the gds operator is required only for GDS functionality. You can confirm availability of each operator by inspecting the output of ds_report to check that compatible status is [OKAY]. Below is a snippet of ds_report output confirming the availability of both async_io and gds operators. If async_io operator is unavailable, you will need to install the appropriate libaio library binaries for your Linux flavor. For example, Ubuntu users will need to run apt install libaio-dev. In general, you should carefully inspect ds_report output for helpful tips such as the following: [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. To enable gds operator, you will need to install NVIDIA GDS by consulting the appropriate guide for bare-metal systems or Azure VMs (coming soon). Creating DeepNVMe Handles DeepNVMe functionality can be accessed through two abstractions: aio_handle and gds_handle. The aio_handle is usable on both host and device tensors. while gds_handle works only on CUDA tensors, but is more efficient. The first step to use DeepNVMe is to create a desired handle. aio_handle requires async_io operator, while gds_handle requires both async_io and gds operators. The following snippets illustrate aio_handle and gds_handle creation respectively. ### Create aio_handle from deepspeed.ops.op_builder import AsyncIOBuilder aio_handle = AsyncIOBuilder().load().aio_handle() ### Create gds_handle from deepspeed.ops.op_builder import GDSBuilder gds_handle = GDSBuilder().load().gds_handle() For simplicity, the above examples illustrate handle creation using default parameters. We expect that handles created with default parameters to provide good performance in most environments. However, you can see below for advanced handle creation. Using DeepNVMe Handles aio_handle and gds_handle provide identical APIs for storing tensors to files or loading tensors from files. A common feature of these APIs is that they take a tensor and a file path as arguments for the desired I/O operation. For best performance, pinned device or host tensors should be used for I/O operations (see here for details). For brevity, this tutorial will use aio_handle for illustration, but keep in mind that gds_handle works similarly. You can see the available APIs in a Python shell via tab completion on an aio_handle object . This is illustrated using tab completion of h.. >python Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h. h.async_pread( h.free_cpu_locked_tensor( h.get_overlap_events( h.get_single_submit( h.new_cpu_locked_tensor( h.pwrite( h.sync_pread( h.wait( h.async_pwrite( h.get_block_size( h.get_queue_depth( h.get_intra_op_parallelism( h.pread( h.read( h.sync_pwrite( h.write( The APIs of interest for performing I/O operations are those named with pread and pwrite substrings. For brevity, we will focus on the file write APIs, namely sync_pwrite, async_pwrite, and pwrite. We will discuss only sync_pwrite and async_pwrite below because they are specializations of pwrite. Blocking File Write sync_pwrite provides the standard blocking semantics of Python file write. The example below illustrates using sync_pwrite to store a 1GB CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(1024**3, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.sync_pwrite(t,'/local_nvme/test_1GB.pt') >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Non-Blocking File Write An important DeepNVMe optimization is the non-blocking I/O semantics which enables Python threads to overlap computations with I/O operations. async_pwrite provides the non-blocking semantics for file writes. The Python thread can later use wait() to synchronize with the I/O operation. async_write can also be used to submit multiple back-to-back non-blocking I/O operations, of which can then be later blocked on using a single wait(). The example below illustrates using async_pwrite to store a 1GB CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(1024**3, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Warning for non-blocking I/O operations: To avoid data races and corruptions, .wait() must be carefully used to serialize the writing of source tensors, and the reading of destination tensors. For example, the following update of t during a non-blocking file write is unsafe and could corrupt /local_nvme/test_1GB.pt. >>> t=torch.empty(1024**3, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> t += 1 # <--- Data race; avoid by preceding with `h.wait()` Similar safety problems apply to reading the destination tensor of a non-blocking file read without .wait() synchronization. Parallel File Write An important DeepNVMe optimization is the ability to parallelize individual I/O operations. This optimization is enabled by specifying the desired parallelism degree when constructing a DeepNVMe handle. Subsequent I/O operations with that handle are automatically parallelized over the requested number of host or device threads, as appropriate. I/O parallelism is composable with either the blocking or non-blocking I/O APIs. The example below illustrates 4-way parallelism of a file write using async_pwrite. Note the use of intra_op_parallelism argument to specify the desired parallelism degree in handle creation. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(1024**3, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle(intra_op_parallelism=4) >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 Pinned Tensors A key part of DeepNVMe optimizations is using direct memory access (DMA) for I/O operations, which requires that the host or device tensor be pinned. To pin host tensors, you can use mechanisms provided by Pytorch or DeepSpeed Accelerators. The following example illustrates writing a pinned CPU tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(1024**3, dtype=torch.uint8).pin_memory() >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle() >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 On the other hand,gds_handle provides new_pinned_device_tensor() and pin_device_tensor() functions for pinning CUDA tensors. The following example illustrates writing a pinned CUDA tensor to a local NVMe file. >>> import os >>> os.path.isfile('/local_nvme/test_1GB.pt') False >>> import torch >>> t=torch.empty(1024**3, dtype=torch.uint8).cuda() >>> from deepspeed.ops.op_builder import GDSBuilder >>> h = GDSBuilder().load().gds_handle() >>> h.pin_device_tensor(t) >>> h.async_pwrite(t,'/local_nvme/test_1GB.pt') >>> h.wait() 1 >>> os.path.isfile('/local_nvme/test_1GB.pt') True >>> os.path.getsize('/local_nvme/test_1GB.pt') 1073741824 >>> h.unpin_device_tensor(t) Putting it together We hope that the above material helps you to get started with DeepNVMe. You can also use the following links to see DeepNVMe usage in real-world Deep Learning applications. Parameter swapper in ZeRO-Inference and ZeRO-Infinity. Optimizer swapper in ZeRO-Infinity. Gradient swapper in ZeRO-Infinity. Simple file read and write operations. Acknowledgements This tutorial has been significantly improved by feedback from Guanhua Wang, Masahiro Tanaka, and Stas Bekman. Appendix Advanced Handle Creation Achieving peak I/O performance with DeepNVMe requires careful configuration of handle creation. In particular, the parameters of aio_handle and gds_handle constructors are performance-critical because they determine how efficiently DeepNVMe interacts with the underlying storage subsystem (i.e., libaio, GDS, PCIe, and SSD). For convenience we make it possible to create handles using default parameter values which will provide decent performance in most scenarios. However, squeezing out every available performance in your environment will likely require tuning the constructor parameters, namely block_size, queue_depth, single_submit, overlap_events, and intra_op_parallelism. The aio_handle constructor parameters and default values are illustrated below: >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> help(AsyncIOBuilder().load().aio_handle()) Help on aio_handle in module async_io object: class aio_handle(pybind11_builtins.pybind11_object) | Method resolution order: | aio_handle | pybind11_builtins.pybind11_object | builtins.object | | Methods defined here: | | __init__(...) | __init__(self: async_io.aio_handle, block_size: int = 1048576, queue_depth: int = 128, single_submit: bool = False, overlap_events: bool = False, intra_op_parallelism: int = 1) -> None | | AIO handle constructor Performance Tuning As discussed earlier, achieving peak DeepNVMe performance for a target workload or environment requires using optimally configured aio_handle or gds_handle handles. For configuration convenience, we provide a utility called ds_nvme_tune to automate the discovery of optimal DeepNVMe configurations. ds_nvme_tune automatically explores a user-specified or default configuration space and recommends the option that provides the best read and write performance. Below is an example usage of ds_nvme_tune to tune aio_handle data transfers between GPU memory and a local NVVMe SSD mounted on /local_nvme. This example used the default configuration space of ds_nvme_tune for tuning. $ ds_nvme_tune --nvme_dir /local_nvme --gpu Running DeepNVMe performance tuning on ['/local_nvme/'] Best performance (GB/sec): read = 3.69, write = 3.18 { "aio": { "single_submit": "false", "overlap_events": "true", "intra_op_parallelism": 8, "queue_depth": 32, "block_size": 1048576 } } The above tuning was executed on a Lambda workstation equipped with two NVIDIA A6000-48GB GPUs, 252GB of DRAM, and a CS3040 NVMe 2TB SDD with peak read and write speeds of 5.6 GB/s and 4.3 GB/s respectively. The tuning required about four and half minutes. Based on the results, one can expect to achieve read and write transfer speeds of 3.69 GB/sec and 3.18 GB/sec respectively by using an aio_handle configured as below. >>> from deepspeed.ops.op_builder import AsyncIOBuilder >>> h = AsyncIOBuilder().load().aio_handle(block_size=1048576, queue_depth=32, single_submit=False, overlap_events=True, intra_op_parallelism=8) The full command line options of ds_nvme_tune can be obtained via the normal -h or --help. usage: ds_nvme_tune [-h] --nvme_dir NVME_DIR [NVME_DIR ...] [--sweep_config SWEEP_CONFIG] [--no_read] [--no_write] [--io_size IO_SIZE] [--gpu] [--gds] [--flush_page_cache] [--log_dir LOG_DIR] [--loops LOOPS] [--verbose] options: -h, --help show this help message and exit --nvme_dir NVME_DIR [NVME_DIR ...] Directory in which to perform I/O tests. A writeable directory on a NVMe device. --sweep_config SWEEP_CONFIG Performance sweep configuration json file. --no_read Disable read performance measurements. --no_write Disable write performance measurements. --io_size IO_SIZE Number of I/O bytes to read/write for performance measurements. --gpu Test tensor transfers between GPU device and NVME device. --gds Run the sweep over NVIDIA GPUDirectStorage operator --flush_page_cache Page cache will not be flushed and reported read speeds may be higher than actual ***Requires sudo access***. --log_dir LOG_DIR Output directory for performance log files. Default is ./_aio_bench_logs --loops LOOPS Count of operation repetitions --verbose Print debugging information. DeepNVMe APIs For convenience, we provide listing and brief descriptions of the DeepNVMe APIs. General I/O APIs The following functions are used for I/O operations with both aio_handle and gds_handle. Function Description async_pread Non-blocking file read into tensor sync_pread Blocking file read into tensor pread File read with blocking and non-blocking options async_pwrite Non-blocking file write from tensor sync_pwrite Blocking file write from tensor pwrite File write with blocking and non-blocking options wait Wait for non-blocking I/O operations to complete GDS-specific APIs The following functions are available only for gds_handle Function Description new_pinned_device_tensor Allocate and pin a device tensor free_pinned_device_tensor Unpin and free a device tensor pin_device_tensor Pin a device tensor unpin_device_tensor unpin a device tensor Handle Settings APIs The following APIs can be used to probe handle configuration. Function Description get_queue_depth Return queue depth setting get_single_submit Return whether single_submit is enabled get_intra_op_parallelism Return I/O parallelism degree get_block_size Return I/O block size setting get_overlap_events Return whether overlap_event is enabled Updated: November 5, 2025 Previous Next
```
libaio
```
**Pattern 2:** Mixture of Experts for NLG models Contents 1. Installation 2. Training NLG+MoE models 2.1. Changes to the model 2.2. Pre-training the Standard MoE model 2.3. Pre-training the PR-MoE model 2.4. Training MoS with reduced model size In this tutorial, we introduce how to apply DeepSpeed Mixture of Experts (MoE) to NLG models, which reduces the training cost by 5 times and reduce the MoE model size by 3 times (details in our Blog). We use the GPT-3 like models in Megatron-LM framework as the example. Before reading this tutorial, we recommend to first read the tutorials about Mixture of Experts and Megatron-LM GPT pre-training. 1. Installation You would need to install DeepSpeed v0.6.0 or higher to use the MoE feature. The MoE for NLG model examples are in the Megatron-DeepSpeed repo under the MoE folder. 2. Training NLG+MoE models 2.1. Changes to the model To apply MoE to the GPT-style model, we made several changes in Megatron framework, mostly in megatron/model/ where we add the MoE layers into the model. 2.2. Pre-training the Standard MoE model We provide example training scripts under examples_deepspeed/MoE which we used to perform the experiments in our Blog. There are a few new hyperparameters for standard MoE model: --num-experts: the number of experts per MoE layer. In our experiments we set it to 128. Larger number of experts tend to provide better convergence, but it’s a diminishing return. --moe-expert-parallel-size: degree of the MoE expert parallelism. In other words, there will be num-experts/moe-expert-parallel-size experts on each GPU. Thus --moe-expert-parallel-size should be no more than both number of GPUs, and --num-experts. --moe-loss-coeff: scaling coefficient for adding MoE loss to model loss. In our experiments we find that 0.01 is a good setting. --moe-train-capacity-factor, --moe-eval-capacity-factor, --moe-min-capacity: these configs determine how many tokens can a single expert handle. Larger numbers could lead to better convergence, but would also lead to slower training since the load would be more unbalanced on different experts. --disable-moe-token-dropping: this will completely remove the limitation of how many tokens can a single expert handle. For the same reason as above, we only recommend using this during inference/eval. 2.3. Pre-training the PR-MoE model PR-MoE is a new designed MoE models, standing for Pyramid-Residual-MoE, which improves the parameter efficiency up to 3x as compared to standard MoE. Please see our Blog for more details. We provide example training scripts under examples_deepspeed/MoE. There are a few different hyperparameters for PR-MoE model compared to standard MoE: --num-experts: Instead of providing a single number, to enable Pyramid-MoE, you need to provide a list, whose length is the same as the number of MoE layers. We suggest to use more experts in the latter stage (close to output) of the model. --mlp-type: chosen from [standard, residual]. When it is residual, Residual-MoE is enabled. In addition to the new hyperparameters above for standard MoE and PR-MoE, for NLG+MoE models we found that it’s helpful to lower the learning rate and increase the learning rate decay duration compared to the base dense model. Details of our tuning can be found in the example training scripts. Regarding training data, we are not able to release our internal data but any public data for Megatron-LM pre-training can be directly used to train MoE models (with the caveat that it might not provide the exact same model quality as in our experiments). For example, we evaluated The Pile dataset (pile.eleuther.ai, github.com/EleutherAI/the-pile) for both dense and MoE models. Table 1 below shows that this public data provides similar evaluation results as our internal data. Model size LAMBADA: completion prediction PIQA: commonsense reasoning BoolQ: reading comprehension RACE-h: reading comprehension TriviaQA: question answering WebQs: question answering Dense NLG: 350M, internal data 0.5203 0.6931 0.5364 0.3177 0.0321 0.0157 350M, public Pile 0.5106 0.6589 0.5933 0.3196 0.0257 0.0064 Standard MoE NLG: 350M+MoE-128, internal data 0.6270 0.7459 0.6046 0.3560 0.1658 0.0517 350M+MoE-128, public Pile 0.6128 0.7323 0.6040 0.3349 0.1111 0.0335 PR-MoE NLG: 350M+MoE-128, internal data 0.6365 0.7399 0.5988 0.3569 0.1630 0.0473 PR-MoE + MoS NLG: 350M+MoE-128, internal data 0.6346 0.7334 0.5807 0.3483 0.1369 0.0522 Table 1: Zero-shot evaluation results (last six columns) for different dense and MoE NLG models. All zero-shot evaluation results use the accuracy metric. 2.4. Training MoS with reduced model size MoS, standing for Mixture-of-Students, is a staged distillation-based technique for compressing large MoE models. MoS further reduces the model size by 12.5%, leading to up 3.7x model size reduction when combined with PR-MoE over the standard MoE. The reduced model size helps reduce the latency and cost during inference. To train an MoS model, one needs to specify a few additional parameters. We will use PR-MoE as an example: --mos: This would enable Mixture-of-Students via knowledge distillation. --load-teacher: This specifies the path to the teacher model checkpoint. This is a mandatory argument for using MoS and the teacher model checkpoint can be obtained by either training a standard MoE or the PR-MoE. num-layers-teacher, --hidden-size-teacher, --hidden-size-teacher, --num-experts-teacher: In addition to the teacher model checkpoint path, we also need to specify the model architecture of the teacher model such as its number of layers, hidden dimension size, and the number of experts per MoE layer. In the case of PR-MoE, we need to also provide a list of experts for the teacher model, where we remove a few expert layers from the teacher model. In addition to the new parameters above, we observe that using the teacher PR-MoE during the entire training process may adversely impact the final student model accuracy. In our experiments, we use a staged distillation method by stopping distillation early in the training process (e.g., after 400K steps) and perform optimization only against the standard language modeling loss for the rest of the training. We provide example training scripts under examples_deepspeed/MoE. Details of our parameter settings can be found in the example training scripts. The performance results of MoS can be seen from our blog post and our paper. Updated: November 5, 2025 Previous Next
```
megatron/model/
```
**Pattern 3:** MoS, standing for Mixture-of-Students, is a staged distillation-based technique for compressing large MoE models. MoS further reduces the model size by 12.5%, leading to up 3.7x model size reduction when combined with PR-MoE over the standard MoE. The reduced model size helps reduce the latency and cost during inference. To train an MoS model, one needs to specify a few additional parameters. We will use PR-MoE as an example:
```
--mos
```
**Pattern 4:** Learning Rate Range Test Contents Learning Rate Range Test (LRRT) Prerequisites LRRT Parameters Required Model Configuration Changes PyTorch Example: Tuning for Large Batch Sizes This tutorial shows how to use to perform Learning Rate range tests in PyTorch. Learning Rate Range Test (LRRT) Learning rate range test ( LRRT ) is a method for discovering the largest learning rate values that can be used to train a model without divergence. Data scientists are often interested in this information because large learning rates lead to faster model convergence than a small learning rates. Moreover, large learning rates are crucial in learning rate schedules such as CLR and 1Cycle, which are used to train effectively with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch frameworks. Prerequisites To use DeepSpeed’s LRRT, you must satisfy the following two conditions: Integrate DeepSpeed into your training script using the Getting Started guide. Add the parameters to configure LRRT to the parameters of your model. The LRRT parameters are defined below. LRRT Parameters LRRT works by linearly increasing the learning rate by a predefined amount, at predefined intervals. Thus, LRRT is a form of learning rate schedule because it defines how and when the learning rate should change during model training. To configure LRRT, you will need to set these parameters: lr_range_test_min_lr : The initial learning rate for training (float) lr_range_test_step_size: The interval for scaling up learning rate, defined in training steps (integer) lr_range_test_step_rate: The scaling factor for increasing learning rate (float) lr_range_test_staircase: If true, learning rate is changed every lr_range_test_step_size training steps, otherwise learning rate is changed at every training step (boolean) Required Model Configuration Changes We will illustrate the required model configuration changes an example LRRT schedule that: Starts training with an initial learning rate of 0.0001 Uses a scaling rate of 5 Uses a scaling interval of 200 training steps Scales learning rate at every training step, i.e., does not use staircase PyTorch For PyTorch models, LRRT is implemented as a learning rate scheduler, a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can add a "scheduler" entry of type "LRRangeTest" into your model configuration as illustrated below: "scheduler": { "type": "LRRangeTest", "params": { "lr_range_test_min_lr": 0.0001, "lr_range_test_step_size": 200, "lr_range_test_step_rate": 5, "lr_range_test_staircase": false } } Example: Tuning for Large Batch Sizes We illustrate how LRRT can benefit data scientists with a snippet of our experience of tuning an internal production model to converge efficiently on larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs (batch size 2048). Our goal was to train the model with the larger batch size to match the performance of the smaller batch size using the same amount of data samples. The challenge here is the well known problem of slow convergence of large batch size training. Our approach was to use a 1Cycle schedule in DeepSpeed to tackle this problem, and we used LRRT to configure the schedule. In the plots below, we illustrate using LRRT to discover the maximum learning rates for effective training with batch size 2048. The plot on the left shows the impact of large learning rates on validation loss over the first 9000 batches of training. The plot on the right shows the learning rate values during the same period of training. Using grid search we discover that the best fixed learning rate for the batch size 2048 is 0.0002. The blue line (lr=0.0002) represents training with this fixed learning rate. We compare the two LRRT schedules with this fixed learning rate. The orange (lr_range_test_step_rate=5) and gray (lr_range_test_step_rate=50) lines represent training with similar LRRT schedules that differ only in lr_range_test_step_rate values. Although the LRRT schedules start from the same base learning rate, the gray line’s learning rate grows about 10 times faster than the orange line. Also, the learning rates of the LRRT schedules had grown larger than that of the blue line in the presented data points. We subsequently refer to the gray line as “fast growing”, and the orange line as “slow growing” LRRT schedules respectively. We make the following observations from this small example. Larger learning rates clearly benefit model performance, up to some point. The fast growing LRRT schedule achieves validation loss of 0.46 after 3000 batches, which the fixed learning rate does not achieve with 9000 batches. The slow growing LRRT does not match that score until after 6000 batches, however it maintains an increasing performance advantage over the fixed learning rate. There is an upper bound on learning rate values that are useful for training the model. The fast growing LRRT schedule hits this boundary quickly and diverges, while the slow growing LRRT will later diverge for the same reason. LRRT helped us discover these boundaries quickly, using less than 2% of the training data. These boundaries are useful information for constructing learning rate schedules. These observations from LRRT helped us to configure the learning rate boundaries and the cycle span for a 1Cycle schedule that solves the problem, as shown below. "OneCycle": { "cycle_min_lr": 0.002, "cycle_max_lr": 0.005, "cycle_first_step_size": 2000, "cycle_second_step_size": 2000, ... } In our experience these are four most critical parameters of 1Cycle schedules. We chose to use the slower LRRT schedule (lr_range_test_step_rate=5) to set cycle_min_lr because it achieves the best loss and the faster schedule diverges fairly quickly. We set cycle_max_lr to 0.005 even though the plot shows that performance was still improving at slightly higher learning rate. This is because we observed that if we wait till the maximum learning rate, the model could be at the point of divergence and impossible to recover. Since it takes 8000 batches for the learning rate to become 0.005, we set cycle_first_step_size and (cycle_second_step_size) to 2000 which is the number of steps that it takes for four GPUs to process 8000 batches. We hope this brief example sparks your imagination on using LRRT for your own unique tuning challenges. Updated: November 5, 2025 Previous Next
```
lr_range_test_min_lr
```
**Pattern 5:** Training Overview and Features Contents Overview Distributed, Effective, and Efficient Training with Ease Speed Memory efficiency Scalability Communication efficiency Data efficiency Supporting long sequence length Fast convergence for effectiveness Good Usability Features Distributed Training with Mixed Precision Mixed Precision Training Single-GPU, Multi-GPU, and Multi-Node Training Pipeline Parallelism Model Parallelism Support for Custom Model Parallelism Integration with Megatron-LM The Zero Redundancy Optimizer Optimizer State and Gradient Partitioning Activation Partitioning Constant Buffer Optimization (CBO) Contiguous Memory Optimization (CMO) ZeRO-Offload Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Communication Overlapping Training Features Simplified training API Activation Checkpointing API Gradient Clipping Automatic loss scaling with mixed precision Training Optimizers 1-bit Adam, 0/1 Adam and 1-bit LAMB optimizers with up to 26x less communication Fused Adam optimizer and arbitrary torch.optim.Optimizer CPU-Adam: High-Performance vectorized implementation of Adam Memory bandwidth optimized FP16 Optimizer Large Batch Training with LAMB Optimizer Memory-Efficient Training with ZeRO Optimizer Training Agnostic Checkpointing Advanced parameter search Learning Rate Range Test 1Cycle Learning Rate Schedule Simplified Data Loader Data Efficiency Curriculum Learning Performance Analysis and Debugging Wall Clock Breakdown Timing Activation Checkpoint Functions Flops Profiler Autotuning Monitor Communication Logging Sparse Attention Mixture of Experts (MoE) Overview Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training. Distributed, Effective, and Efficient Training with Ease The DeepSpeed API is a lightweight wrapper on PyTorch. This means that you can use everything you love in PyTorch and without learning a new platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art training techniques, such as distributed training, mixed precision, gradient accumulation, and checkpoints so that you can focus on your model development. Most importantly, you can leverage the distinctive efficiency and effectiveness benefit of DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch models. Speed DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimizations on compute/communication/memory/IO and effectiveness optimizations on advanced hyperparameter tuning and optimizers. For example: DeepSpeed trains BERT-large to parity in 44 mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs (16 DGX-2 boxes). BERT-large Training Times Devices Source Training Time 1024 V100 GPUs DeepSpeed 44 min 256 V100 GPUs DeepSpeed 2.4 hr 64 V100 GPUs DeepSpeed 8.68 hr 16 V100 GPUs DeepSpeed 33.22 hr BERT code and tutorials will be available soon. DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs. Read more: GPT tutorial Memory efficiency DeepSpeed provides memory-efficient data parallelism and enables training models without model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on a single GPU. In comparison, existing frameworks (e.g., PyTorch’s Distributed Data Parallel) run out of memory with 1.4 billion parameter models. DeepSpeed reduces the training memory footprint through a novel solution called Zero Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states and gradients to save significant memory. Furthermore, it also reduces activation memory and fragmented memory. The current implementation (ZeRO-2) reduces memory by up to 8x relative to the state-of-art. You can read more about ZeRO in our paper, and in our blog posts related to ZeRO-1 and ZeRO-2. With this impressive memory reduction, early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category. For model scientists with limited GPU resources, ZeRO-Offload leverages both CPU and GPU memory for training large models. Using a machine with a single GPU, our users can run models of up to 13 billion parameters without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models. Scalability DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, which we call 3D parallelism. 3D parallelism of DeepSpeed provides system support to run models with trillions of parameters, read more in our press-release and tutorial. DeepSpeed can run large models more efficiently, up to 10x faster for models with various sizes spanning 1.5B to hundred billion. More specifically, the data parallelism powered by ZeRO is complementary and can be combined with different types of model parallelism. It allows DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering significant performance gains compared to using model parallelism alone. Read more: ZeRO paper, and GPT tutorial. The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone. Communication efficiency Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. 1-bit Adam blog post, 1-bit Adam tutorial, 0/1 Adam tutorial, 1-bit LAMB tutorial. Data efficiency DeepSpeed Data Efficiency Library provides efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping. The composed solution enables up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning, or further improve model quality under the same data/time. See more in the tutorial. Supporting long sequence length DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. Read more here. Fast convergence for effectiveness DeepSpeed supports advanced hyperparameter tuning and large batch size optimizers such as LAMB. These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy. Read more: Tuning tutorial. Good Usability Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA’s Megatron-LM. Features Below we provide a brief feature list, see our detailed feature overview for descriptions and usage. Distributed Training with Mixed Precision 16-bit mixed precision Single-GPU/Multi-GPU/Multi-Node Model Parallelism Support for Custom Model Parallelism Integration with Megatron-LM Pipeline Parallelism 3D Parallelism The Zero Redundancy Optimizer Optimizer State and Gradient Partitioning Activation Partitioning Constant Buffer Optimization Contiguous Memory Optimization ZeRO-Offload Leverage both CPU/GPU memory for model training Support 10B model training on a single GPU Ultra-fast dense transformer kernels Sparse attention Memory- and compute-efficient sparse kernels Support 10x long sequences than dense Flexible support to different sparse structures 1-bit Adam, 0/1 Adam and 1-bit LAMB Custom communication collective Up to 26x communication volume saving Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Communication/Computation Overlap Training Features Simplified training API Gradient Clipping Automatic loss scaling with mixed precision Training Optimizers Fused Adam optimizer and arbitrary torch.optim.Optimizer Memory bandwidth optimized FP16 Optimizer Large Batch Training with LAMB Optimizer Memory efficient Training with ZeRO Optimizer CPU-Adam Training Agnostic Checkpointing Advanced Parameter Search Learning Rate Range Test 1Cycle Learning Rate Schedule Simplified Data Loader Data Efficiency Efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping Up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning Or further improve model quality under the same data/time Curriculum Learning A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed Complementary to many other DeepSpeed features Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library. Progressive Layer Dropping Efficient and robust compressed training Up to 2.5x convergence speedup for pre-training Performance Analysis and Debugging Mixture of Experts (MoE) title: “Feature Overview” layout: single permalink: /features/ toc: true toc_label: “Contents” — Distributed Training with Mixed Precision Mixed Precision Training Enable 16-bit (FP16) training by in the deepspeed_config JSON. "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "consecutive_hysteresis": false, "min_loss_scale": 1 } Single-GPU, Multi-GPU, and Multi-Node Training Easily switch between single-GPU, single-node multi-GPU, or multi-node multi-GPU execution by specifying resources with a hostfile. deepspeed --hostfile=<hostfile> \ <client_entry.py> <client args> \ --deepspeed --deepspeed_config ds_config.json The script <client_entry.py> will execute on the resources specified in <hostfile>. Pipeline Parallelism DeepSpeed provides pipeline parallelism for memory- and communication- efficient training. DeepSpeed supports a hybrid combination of data, model, and pipeline parallelism and has scaled to over one trillion parameters using 3D parallelism. Pipeline parallelism can also improve communication efficiency and has accelerated training by up to 7x on low-bandwidth clusters. Model Parallelism Support for Custom Model Parallelism DeepSpeed supports all forms of model parallelism including tensor slicing based approaches such as the Megatron-LM. It does so by only requiring the model parallelism framework to provide a model parallelism unit (mpu) that implements a few bookkeeping functionalities: mpu.get_model_parallel_rank() mpu.get_model_parallel_group() mpu.get_model_parallel_world_size() mpu.get_data_parallel_rank() mpu.get_data_parallel_group() mpu.get_data_parallel_world_size() Integration with Megatron-LM DeepSpeed is fully compatible with Megatron. Please see the Megatron-LM tutorial for details. The Zero Redundancy Optimizer The Zero Redundancy Optimizer (ZeRO) is at the heart of DeepSpeed and enables large model training at a scale that is simply not possible with model parallelism alone. When enabled, ZeRO allows training models with over 13 billion parameters without any model parallelism, and up to 200 billion parameter models with model parallelism on current generation hardware. For more details see the ZeRO paper, GPT tutorial on integration with DeepSpeed. Optimizer State and Gradient Partitioning Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the model states (optimizer states, gradients and parameters) by 8x compared to standard data parallelism by partitioning these states across data parallel process instead of replicating them. Activation Partitioning Activation Partitioning is a memory optimization in ZeRO that can reduce the memory consumed by activations during model parallel training (MP). In MP certain activations maybe required by all MP processes, resulting in a replication of activations across MP GPUs. Activation Partitioning stores these activations in a partitioned state once they are used for computation in the forward propagation. These activations are allgathered right before they are needed again during the backward propagation. By storing activations in a partitioned state, ZeRO in DeepSpeed can reduce the activation memory footprint proportional to the MP degree. Constant Buffer Optimization (CBO) CBO enables high network and memory throughput while restricting memory usage to a constant size. For memory- and network-bound operations such as normalization or allreduce collectives, the performance depends on the size of the operand. Simply fusing all operands into a single large operand can enable great throughput at the expense of unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approximately a pre-defined sized buffer large enough to achieve great performance without the unnecessary memory overhead. Contiguous Memory Optimization (CMO) CMO reduces memory fragmentation during training, preventing out of memory errors due to lack of contiguous memory. Memory fragmentation is a result of interleaving between short lived and long lived memory objects. During the forward propagation activation checkpoints are long lived but the activations that recomputed are short lived. Similarly, during the backward computation, the activation gradients are short lived while the parameter gradients are long lived. CMO transfers activation checkpoints and parameter gradients to contiguous buffers preventing memory fragmentation. ZeRO-Offload ZeRO-Offload pushes the boundary of the maximum model size that can be trained efficiently using minimal GPU resources, by exploiting computational and memory resources on both GPUs and their host CPUs. It allows training up to 13-billion-parameter models on a single NVIDIA V100 GPU, 10x larger than the state-of-the-art, while retaining high training throughput of over 30 teraflops per GPU. For more details see the ZeRO-Offload release blog, and tutorial on integration with DeepSpeed. Additional Memory and Bandwidth Optimizations Smart Gradient Accumulation Gradient accumulation allows running larger batch size with limited memory by breaking an effective batch into several sequential micro-batches, and averaging the parameter gradients across these micro-batches. Furthermore, instead of averaging the gradients of each micro-batch across all GPUs, the gradients are averaged locally during each step of the sequence, and a single allreduce is done at the end of the sequence to produce the averaged gradients for the effective batch across all GPUs. This strategy significantly reduces the communication involved over the approach of averaging globally for each micro-batch, specially when the number of micro-batches per effective batch is large. Communication Overlapping During back propagation, DeepSpeed can overlap the communication required for averaging parameter gradients that have already been computed with the ongoing gradient computation. This computation-communication overlap allows DeepSpeed to achieve higher throughput even at modest batch sizes. Training Features Simplified training API The DeepSpeed core API consists of just a handful of methods: initialization: initialize training: backward and step argument parsing: add_config_arguments checkpointing : load_checkpoint and store_checkpoint DeepSpeed supports most of the features described in this document, via the use of these API, along with a deepspeed_config JSON file for enabling and disabling the features. Please see the core API doc for more details. Activation Checkpointing API DeepSpeed’s Activation Checkpointing API supports activation checkpoint partitioning, cpu checkpointing, and contiguous memory optimizations, while also allowing layerwise profiling. Please see the core API doc for more details. Gradient Clipping { "gradient_clipping": 1.0 } DeepSpeed handles gradient clipping under the hood based on the max gradient norm specified by the user. Please see the core API doc for more details. Automatic loss scaling with mixed precision DeepSpeed internally handles loss scaling for mixed precision training. The parameters for loss scaling can be specified in the deepspeed_config JSON file. Please see the core API doc for more details. Training Optimizers 1-bit Adam, 0/1 Adam and 1-bit LAMB optimizers with up to 26x less communication DeepSpeed has three communication-efficient optimizers called 1-bit Adam, 0/1 Adam and 1-bit LAMB. They offer the same convergence as Adam/LAMB, incur up to 26x less communication that enables up to 6.6x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance, please refer to the 1-bit Adam tutorial, 1-bit Adam blog post, 0/1 Adam tutorial and 1-bit LAMB tutorial. For technical details, please refer to the 1-bit Adam paper, 0/1 Adam paper and 1-bit LAMB paper. Fused Adam optimizer and arbitrary torch.optim.Optimizer With DeepSpeed, the user can choose to use a high performance implementation of ADAM from NVIDIA, or any training optimizer that extends torch’s torch.optim.Optimizer class. CPU-Adam: High-Performance vectorized implementation of Adam We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses AVX-2 by default which can be switched to AVX-512 by setting the build flag, DS_BUILD_AVX512 to 1 when installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between 1 to 10 billion parameters with respect to torch-adam. Memory bandwidth optimized FP16 Optimizer Mixed precision training is handled by the DeepSpeed FP16 Optimizer. This optimizer not only handles FP16 training but is also highly efficient. The performance of weight update is primarily dominated by the memory bandwidth, and the achieved memory bandwidth is dependent on the size of the input operands. The FP16 Optimizer is designed to maximize the achievable memory bandwidth by merging all the parameters of the model into a single large buffer, and applying the weight updates in a single kernel, allowing it to achieve high memory bandwidth. Large Batch Training with LAMB Optimizer DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Optimizer. For more details on LAMB, see the LAMB paper. Memory-Efficient Training with ZeRO Optimizer DeepSpeed can train models with up to 13 billion parameters without model parallelism, and models with up to 200 billion parameters with 16-way model parallelism. This leap in model size is possible through the memory efficiency achieved via the ZeRO Optimizer. For more details see ZeRO paper . Training Agnostic Checkpointing DeepSpeed can simplify checkpointing for you regardless of whether you are using data parallel training, model parallel training, mixed-precision training, a mix of these three, or using the zero optimizer to enable larger model sizes. Please see the Getting Started guide and the core API doc for more details. Advanced parameter search DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for large batch scaling. Learning Rate Range Test Please refer to the Learning Rate Range Test tutorial. 1Cycle Learning Rate Schedule Please refer to the 1Cycle Learning Rate Schedule tutorial. Simplified Data Loader DeepSpeed abstracts away data parallelism and model parallelism from the user when it comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader can automatically handle batch creation appropriately. Data Efficiency Please refer to the Data Efficiency tutorial. Curriculum Learning Please refer to the Curriculum Learning tutorial. Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library. Performance Analysis and Debugging DeepSpeed provides a set of tools for performance analysis and debugging. Wall Clock Breakdown DeepSpeed provides a detailed breakdown of the time spent in different parts of the training. This can be enabled by setting the following in the deepspeed_config file. { "wall_clock_breakdown": true, } Timing Activation Checkpoint Functions When activation checkpointing is enabled, profiling the forward and backward time of each checkpoint function can be enabled in the deepspeed_config file. { "activation_checkpointing": { "profile": true } } Flops Profiler The DeepSpeed flops profiler measures the time, flops and parameters of a PyTorch model and shows which modules or layers are the bottleneck. When used with the DeepSpeed runtime, the flops profiler can be configured in the deepspeed_config file as follows: { "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 3, "detailed": true, } } The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details. Autotuning The DeepSpeed Autotuner uses model information, system information, and heuristics to efficiently tune Zero stage, micro batch size, and other Zero configurations. Using the autotuning feature requires no code change from DeepSpeed users. While "autotuning": {"enabled": true} is the minimal required to enable autotuning, there are other parameters users can define to configure the autotuning process. Below shows major parameters and their default values in the autotuning configuration. Please refer to the Autotuning tutorial for more details. { "autotuning": { "enabled": true, "results_dir": null, "exps_dir": null, "overwrite": false, "metric": "throughput", "num_nodes": null, "num_gpus": null, "start_profile_step": 3, "end_profile_step": 5, "fast": true, "num_tuning_micro_batch_sizes": 3, "tuner_type": "model_based", "tuner_early_stopping": 5, "tuner_num_trials": 50, "arg_mappings": null } } The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details. Monitor The DeepSpeed Monitor logs live training metrics to one or more monitoring backends, including PyTorch’s TensorBoard, WandB, or simply to CSV files. The Monitor can be configured with one or more backends in the deepspeed_config file as follows: { "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } "wandb": { "enabled": true, "team": "my_team", "group": "my_group", "project": "my_project" } "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } } The Monitor can also be added to log custom metrics and client codes. Please refer to the Monitor tutorial for more details. Communication Logging DeepSpeed provides logging of all communication operations launched within deepspeed.comm. The communication logger can be configured in the deepspeed_config file as follows: { "comms_logger": { "enabled": true, "verbose": false, "prof_all": true, "debug": false } } Client codes can then print a summary with a call to deepspeed.comm.log_summary(). For more details and example usage, see the Communication Logging tutorial. Sparse Attention DeepSpeed offers sparse attention to support long sequences. Please refer to the Sparse Attention tutorial. --deepspeed_sparse_attention "sparse_attention": { "mode": "fixed", "block": 16, "different_layout_per_head": true, "num_local_blocks": 4, "num_global_blocks": 1, "attention": "bidirectional", "horizontal_global_attention": false, "num_different_global_patterns": 4 } Mixture of Experts (MoE) To learn more about training Mixture of Experts (MoE) models with DeepSpeed, see our tutorial for more details.
```
torch.optim.Optimizer
```
**Pattern 6:** Flops Profiler Contents Overview Flops Measurement Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism Usage Usage With the DeepSpeed Runtime Example: Megatron-LM Usage Outside the DeepSpeed Runtime In Model Inference Example: AlexNet Example: Bert In Model Training Workflow Example Training Workflow In this tutorial, we introduce the DeepSpeed Flops Profiler and provide examples of its usage. Overview Flops Measurement Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism Usage Overview Effective use of hardware resources is critical to good performance, but performance inefficiency in existing implementations for large-scale model training and inference are often hard to spot and attribute to specific module components. DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point operations per second, i.e., FLOPS) of a model and its submodules, with an eye towards eliminating inefficiencies in existing implementations. Below is an example output for BERT-Large(NVIDIA) on an A100 GPU with batch size 80: -------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 10: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 1 data parallel size: 1 model parallel size: 1 batch size per GPU: 80 params per gpu: 336.23 M params of model = params per GPU * mp_size: 336.23 M fwd MACs per GPU: 3139.93 G fwd flops per GPU: 6279.86 G fwd flops of model = fwd flops per GPU * mp_size: 6279.86 G fwd latency: 76.67 ms bwd latency: 108.02 ms fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 81.9 TFLOPS bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 116.27 TFLOPS fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 102.0 TFLOPS step latency: 34.09 us iter latency: 184.73 ms samples/second: 433.07 ----------------------------- Aggregated Profile per GPU ----------------------------- Top modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'BertForPreTrainingPreLN': '336.23 M'} MACs - {'BertForPreTrainingPreLN': '3139.93 GMACs'} fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'} depth 1: params - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'} MACs - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'} fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'} depth 2: params - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'} MACs - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'} fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'} depth 3: params - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'} MACs - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'} fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''} depth 4: params - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''} MACs - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'} fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'} depth 5: params - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'} MACs - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'} fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'} depth 6: params - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'} MACs - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'} fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS BertForPreTrainingPreLN( 336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS, (bert): BertModel( 335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS, (embeddings): BertEmbeddings(...) (encoder): BertEncoder( 302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS, (FinalLayerNorm): FusedLayerNorm(...) (layer): ModuleList( 302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS, (0): BertLayer( 12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS, (attention): BertAttention( 4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS, (self): BertSelfAttention( 3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS, (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...) (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...) (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...) (dropout): Dropout(...) (softmax): Softmax(...) ) (output): BertSelfOutput( 1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS, (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...) (dropout): Dropout(...) ) ) (PreAttentionLayerNorm): FusedLayerNorm(...) (PostAttentionLayerNorm): FusedLayerNorm(...) (intermediate): BertIntermediate( 4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS, (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...) ) (output): BertOutput( 4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS, (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...) (dropout): Dropout(...) ) ) ... (23): BertLayer(...) ) ) (pooler): BertPooler(...) ) (cls): BertPreTrainingHeads(...) ) ------------------------------------------------------------------------------ In the summary profile, the DeepSpeed Flops Profiler outputs the number of parameters, floating-point operations (flops), FLOPS, latency, and throughput in samples/second of the model. This profile shows how much performance gap (compared to the peak hardware performance) the current model execution has and helps users tune the training or inference setup (e.g., hyperparameters, data parallelism, model parallelism, system configurations, etc.) for better performance. The DeepSpeed Flops Profiler also measures significant modules at different model depths (aggregated profile) and module-specific profile in the model architecture (detailed profile). Using these profiles, DeepSpeed users can understand how each layer or submodule contributes to the overall model complexity/performance. Then users can adjust or refactor the model design to improve performance. For example, using the profiler, DeepSpeed users can quantitatively tell if stacking smaller layers is lighter or more performant than having bigger ones. The aggregated and detailed profiles also allow users to quickly identify bottleneck modules. In the BERT-Large example above, using the DeepSpeed Flops Profiler, we find that BertLayer is the most significant layer and contains quite a few dropout, softmax, and layer norm along with linear modules. These modules are not heavy in flops and would trigger many GPU kernel invocations and create excessive read/write requests to memory. The pattern shown in the detailed profile suggests this is a perfect match for kernel fusion, and we developed fused transformer-kernels to reduce data movement (see DeepSpeedBert). After applying our optimizations, we see a 25% improvement in FLOPS per GPU and overall training samples/second in the DeepSpeed Flops Profiler output. The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime without any user code change or be used independently from DeepSpeed as a standalone package. When using DeepSpeed for model training, the profiler can be enabled in the DeepSpeed configuration file. As a standalone package, the profiler API can be used in both training and inference code. The DeepSpeed profiler is still under active development and includes just initial features. Stay connected for more exciting features to be added soon. Flops Measurement Similar to existing flops calculation tools or methods, the DeepSpeed Flops Profiler measures the flops of the forward pass of a module and the flops of the backward pass is estimated as 2 times of that of the forward pass. Different from the PyTorch profiler which calculates the flops of PyTorch operators, the DeepSpeed Flops Profiler measures the flops within modules in a model and provides more insights to the users about the model execution. The flops estimation is partly inspired by ptflops with the major difference being that the DeepSpeed Flops Profiler not only supports flops computation directly at module level, but can also capture torch.nn.functional invoked in a module to estimate the flops. Thus the DeepSpeed Flops Profiler allows for customized modules in the model, e.g., ParallelTransformerLayerworks, ParallelSelfAttention, RowParallelLinear, etc. in Megatron-LM. This is in contrast to ptflops which requires users to write customized flops calculation functions for each customized module. Multi-GPU, Multi-node, Data Parallelism, and Model Parallelism The DeepSpeed Flops Profiler outputs the per GPU profile as well as the world size, data parallel size, and model parallel size. For models running on multi-GPU or multi-node, only change of the model parallelism (e.g., --model-parallel-size in Megatron-LM) affects the number of flops and parameters profiled, i.e., model_parallel_size * flops = total_flops and model_parallel_size * parameters = total_parameters. The data parallel size or world size (related to the number of GPUs or nodes) does not affect the per GPU profile. Usage The DeepSpeed Flops Profiler can be used with the DeepSpeed runtime or as a standalone package. When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file without user code changes. To use the flops profiler outside the DeepSpeed runtime, install DeepSpeed and import the flops_profiler package to use the APIs directly. Examples of each usage are given below. Usage With the DeepSpeed Runtime Example: Megatron-LM Usage Outside the DeepSpeed Runtime In Model Inference Example: AlexNet Example: Bert In Model Training Workflow Example Training Workflow Usage With the DeepSpeed Runtime When using DeepSpeed for model training, the profiler can be configured in the deepspeed configuration file. No explicit API calls are needed to use the profiler. The profiler can be enabled by adding the following field to deepspeed’s configuration json file. Refer to flops profiler for details. { "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } } Example: Megatron-LM For information on running Megatron-LM with DeepSpeed, please refer to our tutorial Megatron-LM. An example output of 12-layer Megatron-LM model (hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024) is shown below. -------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 10: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 1 data parallel size: 1 model parallel size: 1 batch size per GPU: 1024 params per gpu: 1.29 M params of model = params per GPU * mp_size: 1.29 M fwd MACs per GPU: 41271.95 G fwd flops per GPU: 82543.9 G fwd flops of model = fwd flops per GPU * mp_size: 82543.9 G fwd latency: 1.89 s bwd latency: 5.38 s fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 43.68 TFLOPS bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 30.7 TFLOPS fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 34.07 TFLOPS step latency: 34.12 s iter latency: 41.39 s samples/second: 24.74 ----------------------------- Aggregated Profile per GPU ----------------------------- Top 1 modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'GPT2Model': '1.29 M'} MACs - {'GPT2Model': '41271.95 GMACs'} fwd latency - {'GPT2Model': '1.84 s'} depth 1: params - {'TransformerLanguageModel': '1.29 M'} MACs - {'TransformerLanguageModel': '39584.03 GMACs'} fwd latency - {'TransformerLanguageModel': '1.83 s'} depth 2: params - {'ParallelTransformer': '1.29 M'} MACs - {'ParallelTransformer': '39584.03 GMACs'} fwd latency - {'ParallelTransformer': '1.81 s'} depth 3: params - {'ModuleList': '1.28 M'} MACs - {'ModuleList': '39584.03 GMACs'} fwd latency - {'ModuleList': '1.3 s'} depth 4: params - {'ParallelTransformerLayerPart2': '688.15 k'} MACs - {'ParallelTransformerLayerPart2': '26388.28 GMACs'} fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'} depth 5: params - {'ParallelMLP': '491.54 k'} MACs - {'ParallelMLP': '26388.28 GMACs'} fwd latency - {'ParallelMLP': '849.4 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'. 1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput. 2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed. GPT2Model( 1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS, (language_model): TransformerLanguageModel( 1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS, (embedding): Embedding( 2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS, (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, ) (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192) (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False) ) (transformer): ParallelTransformer( 1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS, (layers): ModuleList( 1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS, (0): ParallelTransformerLayerPart1( 49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS, (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) (attention): ParallelSelfAttention( 32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS, (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, ) (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, ) (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False) (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, ) ) ) (1): ParallelTransformerLayerPart2( 57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS, (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) (mlp): ParallelMLP( 40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS, (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, ) (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, ) ) ) ... (23): ParallelTransformerLayerPart2(...) ) (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True) ) ) ) ------------------------------------------------------------------------------ Usage Outside the DeepSpeed Runtime The profiler can be used as a standalone package outside of the DeepSpeed runtime. One can simply install DeepSpeed and import the flops_profiler package to use the APIs directly. Refer to installation of DeepSpeed for installing DeepSpeed. In Model Inference To profile a trained model in inference, use the get_model_profile function. Examples are given below. Example: AlexNet The following example shows how to profile AlexNet using the DeepSpeed flops profiler. import torchvision.models as models import torch from deepspeed.profiling.flops_profiler import get_model_profile from deepspeed.accelerator import get_accelerator with get_accelerator().device(0): model = models.alexnet() batch_size = 256 flops, macs, params = get_model_profile(model=model, # model input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument. args=None, # list of positional arguments to the model. kwargs=None, # dictionary of keyword arguments to the model. print_profile=True, # prints the model graph with the measured profile attached to each module detailed=True, # print the detailed profile module_depth=-1, # depth into the nested modules, with -1 being the inner most modules top_modules=1, # the number of top modules to print aggregated profile warm_up=10, # the number of warm-ups before measuring the time of each module as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k) output_file=None, # path to the output file. If None, the profiler prints to stdout. ignore_modules=None) # the list of modules to ignore in the profiling Example: Bert from functools import partial import torch from transformers import BertForSequenceClassification, BertTokenizer from deepspeed.profiling.flops_profiler import get_model_profile from deepspeed.accelerator import get_accelerator def bert_input_constructor(batch_size, seq_len, tokenizer): fake_seq = "" for _ in range(seq_len - 2): # ignore the two special tokens [CLS] and [SEP] fake_seq += tokenizer.pad_token inputs = tokenizer([fake_seq] * batch_size, padding=True, truncation=True, return_tensors="pt") labels = torch.tensor([1] * batch_size) inputs = dict(inputs) inputs.update({"labels": labels}) return inputs with get_accelerator().device(0): tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForSequenceClassification.from_pretrained('bert-base-uncased') batch_size = 4 seq_len = 128 enable_profile = True if enable_profile: flops, macs, params = get_model_profile( model, kwargs=bert_input_constructor(batch_size, seq_len, tokenizer), print_profile=True, detailed=True, ) else: inputs = bert_input_constructor((batch_size, seq_len), tokenizer) outputs = model(inputs) In Model Training Workflow To profile model forward in a training workflow, use the FlopsProfilerclass. The FlopsProfilerclass provides the following methods: start_profile() - starts profiling get_total_flops(as_string=False) - returns the total number of floating-point operations in the model get_total_macs(as_string=False) - returns the total number of MACs in the model get_total_params(as_string=False) - returns the total number of parameters in the model print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None) - prints the model profile stop_profile() - stops profiling. This stops the flops counting in the model. end_profile() - cleans up. This cleans up the profile attributes added to the model during the profiling. This should be invoked at the end of the profiling and AFTER get_total_flops, get_total_params or print_model_profile. Example Training Workflow Below is an example of this usage in a typical training workflow. from deepspeed.profiling.flops_profiler import FlopsProfiler model = Model() prof = FlopsProfiler(model) profile_step = 5 print_profile= True for step, batch in enumerate(data_loader): # start profiling at training step "profile_step" if step == profile_step: prof.start_profile() # forward() method loss = model(batch) # end profiling and print output if step == profile_step: # if using multi nodes, check global_rank == 0 as well prof.stop_profile() flops = prof.get_total_flops() macs = prof.get_total_macs() params = prof.get_total_params() if print_profile: prof.print_model_profile(profile_step=profile_step) prof.end_profile() # runs backpropagation loss.backward() # weight update optimizer.step() Updated: November 5, 2025 Previous Next
```
80
```
**Pattern 7:** DeepSpeed Configuration JSON Contents Batch Size Related Parameters Optimizer Parameters Scheduler Parameters Communication options FP16 training options BFLOAT16 training options Automatic mixed precision (AMP) training options Gradient Clipping ZeRO Optimizations for FP16 Training Parameter offloading Optimizer offloading Asynchronous I/O Logging Autotuning Flops Profiler Activation Checkpointing Sparse Attention Data Efficiency Curriculum Learning Monitoring Module Elastic Training Config (V0.1 and V0.2) Communication Logging Compression Layer Reduction Weight Quantization Activation Quantization Sparse Pruning Row Pruning Head Pruning Channel Pruning Checkpoint options Data Type options Batch Size Related Parameters Note: train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed. train_batch_size: [integer] Value Example The effective training batch size. This is the amount of data samples that leads to one step of model update. train_batch_size is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., train_micro_batch_size_per_gpu), the gradient accumulation steps (a.k.a., gradient_accumulation_steps), and the number of GPUs. Can be omitted if both train_micro_batch_size_per_gpu and gradient_accumulation_steps are provided. 32 train_micro_batch_size_per_gpu: [integer] Description Default Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both train_batch_size and gradient_accumulation_steps are provided. train_batch_size value gradient_accumulation_steps: [integer] Description Default Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. Can be omitted if both train_batch_size and train_micro_batch_size_per_gpu are provided. 1 Optimizer Parameters optimizer: [dictionary] Fields Value Example type The optimizer name. DeepSpeed natively supports Adam, AdamW, OneBitAdam, Lamb, and OneBitLamb optimizers (See here for details) and will import other optimizers from torch. "Adam" params Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for Adam). {"lr": 0.001, "eps": 1e-8} Example of optimizer with Adam "optimizer": { "type": "Adam", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-8, "weight_decay": 3e-7 } } The Adam optimizer also supports the following two params keys/values in addition to the standard parameters from torch.optim.Adam: “params” key Description Default torch_adam Use torch’s implementation of adam instead of our fused adam implementation false adam_w_mode Apply L2 regularization (also known as AdamW) true Another example of optimizer with 1-bit Adam specific parameters is as follows. "optimizer": { "type": "OneBitAdam", "params": { "lr": 0.001, "betas": [ 0.8, 0.999 ], "eps": 1e-8, "weight_decay": 3e-7, "freeze_step": 400, "cuda_aware": false, "comm_backend_name": "nccl" } } The 1-bit Adam optimizer supports the following three params keys/values in addition to the standard Adam (learn more in our tutorial): “params” key Description Default freeze_step Number of warm up steps before 1-bit compression gets applied to the communication 100000 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” A variant optimizer for 1-bit Adam is 0/1 Adam, which further optimizes 1-bit Adam via adaptive variance freezing and 1-bit synchronization over optimizer states. "optimizer": { "type": "ZeroOneAdam", "params": { "lr": 1e-3, "weight_decay": 0.01, "bias_correction": false, "var_freeze_step": 1000, "var_update_scaler": 16, "local_step_scaler": 1000, "local_step_clipper": 16, "cuda_aware": false, "comm_backend_name": "nccl" } } 0/1 Adam supports the following params key/values in addition to standard Adam (learn more in our tutorial.) “params” key Description Default var_freeze_step The latest step to update the variance 100000 var_update_scaler The interval to update the variance 16 local_step_scaler The interval to scale the local steps interval according to the learning rate policy 32678 local_step_clipper The largest interval for local steps with learning rate policy 16 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” Another example of optimizer with 1-bit LAMB "optimizer": { "type": "OneBitLamb", "params": { "lr": 11e-3, "weight_decay": 0.01, "bias_correction": false, "max_coeff": 0.3, "min_coeff": 0.01, "freeze_step": 1000, "cuda_aware": false, "comm_backend_name": "nccl", "coeff_beta": 0.9, "factor_max": 4.0, "factor_min": 0.5, "factor_threshold": 0.1 } } The 1-bit LAMB optimizer supports the following params keys/values in addition to the standard LAMB (learn more in our tutorial): “params” key Description Default max_coeff Scaling coefficient upper bound for original LAMB algorithm and 1-bit LAMB’s warmup stage 10.0 min_coeff Scaling coefficient lower bound for original LAMB algorithm and 1-bit LAMB’s warmup stage 0.01 freeze_step Number of warm up steps before 1-bit compression gets applied to the communication 100000 cuda_aware To indicate that the underlying MPI library supports CUDA-Aware communication false comm_backend_name To indicate which backend implementation to use “nccl” coeff_beta Coefficient used for computing running averages of lamb coefficient 0.9 factor_max Maximum value of scaling factor to the frozen lamb coefficient during compression stage 4.0 factor_min Minimum value of scaling factor to the frozen lamb coefficient during compression stage 0.5 factor_threshold Threshold of how much the scaling factor can fluctuate between steps 0.1 Scheduler Parameters DeepSpeed calls the step() method of the scheduler at every training step when model_engine.step() is executed. scheduler: [dictionary] Fields Value Example type The scheduler name. See here for list of support schedulers. "WarmupLR" params Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature. {"warmup_min_lr": 0, "warmup_max_lr": 0.001} Example of scheduler "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.001, "warmup_num_steps": 1000 } } Communication options communication_data_type: [string] Description Default During gradient averaging perform communication with selected data type. By default it will be determined by selected regime None prescale_gradients: [boolean] Description Default Scale gradients before doing allreduce false gradient_predivide_factor: [float] Description Default Before gradient averaging predivide gradients by a specified factor, can sometimes help with fp16 stability when scaling to large numbers of GPUs 1.0 sparse_gradients: [boolean] Description Default Enable sparse compression of torch.nn.Embedding gradients. This feature is essentially deprecated as we don’t see use cases for it as much anymore. It should be noted that this feature is not compatible with torch.sparse related features. false FP16 training options Note: this mode cannot be combined with the amp mode described below. fp16: [dictionary] Description Default Configuration for using mixed precision/FP16 training that leverages NVIDIA’s Apex package. An example, including the available dictionary keys is illustrated below. NOTE: this does not use Apex’s AMP mode that allows for more flexibility in mixed precision training modes, this mode is similar to AMP’s O2 mode. Please see AMP support below if you want to use more complex mixed precision modes. If you want to use ZeRO (currently) you must use this mode. None "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "consecutive_hysteresis": false, "min_loss_scale": 1 } fp16:enabled: [boolean] Description Default enabled is a fp16 parameter indicating whether or not FP16 training enabled. false fp16:auto_cast: [boolean] Description Default auto_cast automatically casts inputs to fp16 false fp16:loss_scale: [float] Description Default loss_scale is a fp16 parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. 0.0 fp16:initial_scale_power: [integer] Description Default initial_scale_power is a fp16 parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2initial_scale_power. 16 fp16:loss_scale_window: [integer] Description Default loss_scale_window is a fp16 parameter representing the window over which to raise/lower the dynamic loss scale value. 1000 fp16:hysteresis: [integer] Description Default hysteresis is a fp16 parameter representing the delay shift in dynamic loss scaling. 2 fp16:consecutive_hysteresis: [boolean] Description Default consecutive_hysteresis is a fp16 parameter representing whether to refill the hysteresis if we reach an iteration that doesn’t overflow false fp16:min_loss_scale: [integer] Description Default min_loss_scale is a fp16 parameter representing the minimum dynamic loss scale value. 1 BFLOAT16 training options Note: this mode cannot be combined with the amp mode described below. Note: this mode cannot be combined with the fp16 mode described above. bf16: [dictionary] Description Default Configuration for using bfloat16 floating-point format as an alternative to FP16. BFLOAT16 requires hardware support (e.g., NVIDIA A100). An example, including the available dictionary keys is illustrated below. Training with bfloat16 does not require loss scaling. None "bf16": { "enabled": true } bf16:enabled: [boolean] Description Default enabled indicates whether BFLOAT16 training is enabled. false Automatic mixed precision (AMP) training options Note: this mode cannot be combined with the fp16 mode described above. In addition this mode is not currently compatible with ZeRO. amp: [dictionary] Description Default Configuration for using automatic mixed precision (AMP) training that leverages NVIDIA’s Apex AMP package. An example, including the available dictionary keys is illustrated below. Is not compatible with fp16 mode above or ZeRO. Any parameters outside of “enabled” will be passed to AMP’s initialize call, see the API and descriptions here at the apex.amp.initialize documentation. None "amp": { "enabled": true, ... "opt_level": "O1", ... } amp:enabled: [boolean] Description Default enabled is an amp parameter indicating whether or not AMP training is enabled. false amp params: [various] Description Default Any parameters outside of “enabled” will be passed to AMP’s initialize call, see the API and descriptions here at the apex.amp.initialize documentation. None Gradient Clipping gradient_clipping: [float] Description Default Enable gradient clipping with value 1.0 ZeRO Optimizations for FP16 Training Enabling and configuring ZeRO memory optimizations "zero_optimization": { "stage": [0|1|2|3], "allgather_partitions": [true|false], "allgather_bucket_size": 5e8, "overlap_comm": false, "reduce_scatter": [true|false], "reduce_bucket_size": 5e8, "contiguous_gradients" : [true|false], "offload_param": { ... }, "offload_optimizer": { ... }, "stage3_max_live_parameters" : 1e9, "stage3_max_reuse_distance" : 1e9, "stage3_prefetch_bucket_size" : 5e8, "stage3_param_persistence_threshold" : 1e6, "sub_group_size" : 1e12, "elastic_checkpoint" : [true|false], "stage3_gather_16bit_weights_on_model_save": [true|false], "ignore_unused_parameters": [true|false], "round_robin_gradients": [true|false], "zero_hpz_partition_size": 1, "zero_quantized_weights": [true|false], "zero_quantized_gradients": [true|false], "log_trace_cache_warnings": [true|false], } zero_optimization: [dictionary] Description Default Enable ZeRO memory optimizations, compatible with FP16/BF16/FP32 and the Adam optimizer. false stage: [integer] Description Default Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively. 0 allgather_partitions: [boolean] Description Default Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step true allgather_bucket_size: [integer] Description Default Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes 5e8 overlap_comm: [boolean] Description Default Attempts to overlap the reduction of the gradients with backward computation false reduce_scatter: [boolean] Description Default Uses reduce or reduce scatter instead of allreduce to average gradients true reduce_bucket_size: [integer] Description Default Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes 5e8 contiguous_gradients: [boolean] Description Default Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. True load_from_fp32_weights: [boolean] Description Default Initialize fp32 master weights from fp32 copies in checkpoint (no precision loss) or from model’s fp16 copies (with precision loss). This can be used to initialize optimizer state even when checkpoint is missing optimizer state. True grad_hooks: [boolean] Description Default For use with ZeRO stage 1, enable backward hooks to reduce gradients during the backward pass or wait until the end of the backward pass. True round_robin_gradients: [boolean] Description Default Stage 1 and 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). False offload_param: [dictionary] Description Default Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. See here for more details. False offload_optimizer: [dictionary] Description Default Enable offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid for ZeRO stage 1, 2, 3. See here for more details. False stage3_max_live_parameters: [integer] Description Default The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication. 1e9 stage3_max_reuse_distance: [integer] Description Default Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication. 1e9 stage3_prefetch_bucket_size: [integer] Description Default The size of the fixed buffer for prefetching parameters. Smaller values use less memory, but can increase stalls due to communication. 5e8 stage3_param_persistence_threshold: [integer] Description Default Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages). 1e5 stage3_gather_16bit_weights_on_model_save: [boolean] Description Default Consolidate the weights before saving the model by save_16bit_model(). Since the weights are partitioned across GPUs, they aren’t part of state_dict, so this function automatically gathers the weights when this option is enabled and then saves the fp16 model weights. False stage3_module_granularity_threshold: [integer] | Description | Default | |——————————————————————————————————————————————————————————————————————————————————————————–| ——- | | The granularity of a module is determined by the ratio of parameter_count / (1 + descendant_count). ZeRO3 classifies modules with a granularity below the threshold as fine-grained, treating them as integral units during parameter fetching. This reduces host and communication overhead from separate hooks. | 0 | zero_hpz_partition_size: [integer] Description Default Number of ranks in hiearchical partitioning ZeRO (hpZ) secondary tensor group of ZeRO++, default is 1 meaning no hpZ, ideal is number of ranks (gpus) per node. 1 zero_quantized_weights: [boolean] Description Default Boolean indicating whether to enable communication efficient quantized weights of ZeRO++. False zero_quantized_gradients: [boolean] Description Default Boolean indicating whether to enable communication efficient quantized gradients of ZeRO++. False log_trace_cache_warnings: [boolean] Description Default Log warnings from trace cache optimization of parameter sharding, such as cache invalidation events. False cpu_offload: [boolean] Deprecated: cpu_offload is deprecated and will be removed in future, please use offload_optimizer instead. Description Default Enable offloading of optimizer memory and computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid with stage 1 and 2. False Parameter offloading Enabling and configuring ZeRO optimization of parameter offloading to CPU/NVMe. Available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered. "offload_param": { "device": "[cpu|nvme]", "nvme_path": "/local_nvme", "pin_memory": [true|false], "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 } device: [string] Description Default Device memory to offload model parameters. Supported options are cpu and nvme. cpu nvme_path: [string] Description Default Filesystem path for NVMe device for parameter offloading. /local_nvme pin_memory: [boolean] Description Default Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. false buffer_count: [integer] Description Default Number of buffers in buffer pool for parameter offloading to NVMe. 5 buffer_size: [integer] Description Default Size of buffers in buffer pool for parameter offloading to NVMe. 1e8 max_in_cpu: [integer] Description Default Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled. 1e9 Optimizer offloading Enabling and configuring ZeRO optimization of offloading optimizer computation to CPU and state to CPU/NVMe. CPU offloading is available with ZeRO stage 1, 2, 3. NVMe offloading is available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered. "offload_optimizer": { "device": "[cpu|nvme]", "nvme_path": "/local_nvme", "pin_memory": [true|false], "ratio": 0.3, "buffer_count": 4, "fast_init": false } device: [string] Description Default Device memory to offload optimizer state. Supported options are cpu and nvme. Optimizer computation is offload to CPU regardless of device option. cpu nvme_path: [string] Description Default Filesystem path for NVMe device for optimizer state offloading. /local_nvme pin_memory: [boolean] Description Default Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. false ratio: [float] Description Default the ratio of parameters updating (i.e. optimizer step) on CPU side. 1 buffer_count: [integer] Description Default Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance). 4 fast_init: [boolean] Description Default Enable fast optimizer initialization when offloading to NVMe. false Asynchronous I/O Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio). "aio": { "block_size": 1048576, "queue_depth": 8, "thread_count": 1, "single_submit": false, "overlap_events": true } block_size: [integer] Description Default I/O block size in bytes. 1048576 queue_depth: [integer] Description Default I/O queue depth. 8 thread_count: [integer] Description Default Intra-request parallelism for each read/write submitted by a user thread. 1 single_submit: [boolean] Description Default Submit requests to storage device as multiple individual requests as opposed to one block of requests. false overlap_events: [boolean] Description Default Submit requests to storage device in an overlapped fashion without waiting for completion of earlier requests. true ignore_unused_parameters: [boolean] Description Default Unused parameters in modules may be unexpected in static networks, but could be normal in dynamic networks. This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to True by default, which means unused parameters are ignored and training continues. Now is just used in stage 2. True Logging steps_per_print: [integer] Description Default Print progress report every N training steps. The report includes the number of training steps, number of skipped optimizer updates (likely due to overflows in mixed-precision training), current learning rate, and current momentum. 10 wall_clock_breakdown: [boolean] Description Default Enable timing of the latency of forward/backward/update training phases false dump_state: [boolean] Description Default Print out state information of DeepSpeed object after initialization false Autotuning { "autotuning": { "enabled": false, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": false, "metric": "throughput", "start_profile_step": 3, "end_profile_step": 5, "fast": true, "max_train_batch_size": null, "mp_size": 1, "num_tuning_micro_batch_sizes": 3, "tuner_type": "model_based", "tuner_early_stopping": 5, "tuner_num_trials": 50, "arg_mappings": null } } enabled: [boolean] Description Default Enables the autotuner. false results_dir: [string] Description Default Path to the autotuning experiment results directory. The default appears in the working directory from which Deepspeed was launched. “autotuning_results” exps_dir: [string] Description Default Path to the auotuning experiment descriptions directory. The default appears in the working directory from which Deepspeed was launched. “autotuning_exps” overwrite: [boolean] Description Default Whether to run autotuning experiments whose results already exist. Setting it to true would overwrite the existing result. false metric: [string] Description Default The performance metric to use for ranking autotuning experiments. latency, throughput, and FLOPS are currently supported, referring to training step latency, training samples per second, and floating-point operations per second achieved per GPU respectively. throughput start_profile_step: [integer] Description Default The global training step at which to start profiling in an autotuning experiment. Note that warm-up is needed for accurate performance measurement. 3 end_profile_step: [integer] Description Default The global training step at which to end profiling in an autotuning experiment. Must not be less than start_profile_step. 5 fast: [boolean] Description Default Enables fast-model autotuning where only Zero stages and micro-batch sizes per GPU are tuned. true max_train_batch_size: [int] Description Default The maximum train batch size (global effective batch size) for the model training. null mp_size: [int] Description Default Model parallelism degree. 1 num_tuning_micro_batch_sizes: [integer] Description Default The number of micro-batch sizes to explore. 3 tuner_type: [string] Description Default The algorithm defines the order of autotuning space exploration within a ZeRO stage. model_based tuner_early_stopping: [integer] Description Default The number of experiments to run beyond the current best experiment. If no better experiment is found within that number, the Autotuner stops the exploration. 5 tuner_num_trials: [integer] Description Default The maximum number of experiments to explore in the tuning space within a ZeRO stage. 50 Flops Profiler { "flops_profiler": { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null, } } enabled: [boolean] Description Default Enables the flops profiler. This would also enables wall_clock_breakdown false profile_step: [integer] Description Default The global training step at which to profile. Note that warm up steps are needed for accurate time measurement. 1 module_depth: [integer] Description Default The depth of the model at which to print the aggregated module information. When set to -1, it prints information from the top module to the innermost modules (the maximum depth). -1 top_modules: [integer] Description Default Limits the aggregated profile output to the number of top modules specified. 1 detailed: [boolean] Description Default Whether to print the detailed model profile. true output_file: [string] Description Default Path to the output file. If None, the profiler prints to stdout.. null Activation Checkpointing "activation_checkpointing": { "partition_activations": false, "cpu_checkpointing": false, "contiguous_memory_optimization": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } partition_activations: [boolean] Description Default Enables partition activation when used with model parallelism false cpu_checkpointing: [boolean] Description Default Offloads partitioned activations to CPU if partition_activations is enabled false contiguous_memory_optimization: [boolean] Description Default Copies partitioned activations so that they are contiguous in memory false number_checkpoints: [integer] Description Default Total number of activation checkpoints used to allocate memory buffer for contiguous_memory_optimization None synchronize_checkpoint_boundary: [boolean] Description Default Inserts get_accelerator().synchronize() at each checkpoint boundary. false profile: [boolean] Description Default Logs the forward and backward time for each checkpoint function false Sparse Attention sparse_attention: [dictionary] Fields Value Example mode A string determining sparsity structure type. Deepspeed currently supports "dense", "fixed", "bigbird", "bslongformer", and "variable". "fixed" block An integer determining the block size. Current implementation of sparse self-attention is based on blocked sparse matrices. In which this parameter defines size of such blocks, Block X Block. 16 different_layout_per_head A boolean determining if each head should be assigned a different sparsity layout; this will be satisfied based on availability. false num_local_blocks An integer determining the number of random blocks in each block row; only used in "fixed" mode. 4 num_global_blocks An integer determining how many consecutive blocks in a local window is used as the representative of the window for global attention; used in "fixed" and "bigbird" modes. 1 attention A string determining attention type. Attention can be "unidirectional", such as autoregressive models, in which tokens attend only to tokens appear before them in the context. Considering that, the upper triangular of attention matrix is empty. Or it can be "bidirectional", such as BERT, in which tokens can attend to any other tokens before or after them. Then, the upper triangular part of the attention matrix is mirror of the lower triangular; used in "fixed" and "variable" modes. "bidirectional" horizontal_global_attention A boolean determining if blocks that are global representative of a local window, also attend to all other blocks. This is valid only if attention type is "bidirectional". Looking at the attention matrix, that means global attention not only includes the vertical blocks, but also horizontal blocks; used in "fixed" and "variable" modes. false num_different_global_patterns An integer determining number of different global attentions layouts. While global attention can be fixed by which block/s are representative of any local window, since there are multi-heads, each head can use a different global representative; used only in "fixed" mode. 4 num_random_blocks An integer determining the number of random blocks in each block row; used in "variable" and "bigbird" modes. 0 local_window_blocks A list of integers determining the number of blocks in each local attention window. It assumes first number determines # of blocks in the first local window, second the second window, …, and the last number determines the number of blocks in the remaining local windows; only used in "variable" mode. [4] global_block_indices A list of integers determining which blocks are considered as global attention. Given indices, determine the blocks that all other token blocks attend to and they attend to all other token blocks. Notice that if global_block_end_indices parameter is set, this parameter is used as starting index of each global window; used in "variable" and "bslongformer" modes. [0] global_block_end_indices A list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size of global_block_indices parameter, and combining this two parameters, for each index i, blocks from global_block_indices[i] to global_block_end_indices[i], exclusive, are considered as global attention; used in "variable" and "bslongformer" modes. None num_sliding_window_blocks An integer determining the number of blocks in sliding local attention window; used in "bigbird" and "bslongformer" modes. 3 Example of sparse_attention "sparse_attention": { "mode": "fixed", "block": 16, "different_layout_per_head": true, "num_local_blocks": 4, "num_global_blocks": 1, "attention": "bidirectional", "horizontal_global_attention": false, "num_different_global_patterns": 4, "num_random_blocks": 0, "local_window_blocks": [4], "global_block_indices": [0], "global_block_end_indices": None, "num_sliding_window_blocks": 3 } Data Efficiency DeepSpeed Data Efficiency Library includes two techniques: curriculum learning and random layerwise token dropping (random-LTD). Read more about how to use the DeepSpeed Data Efficiency Library in our tutorial. "data_efficiency": { "enabled": true, "seed": 1234, "data_routing": { "enabled": true, "random_ltd":{ "enabled": true, "total_layer_num": 24, "random_ltd_layer_num": 22, "random_ltd_layer_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], "model_mask_name": "attention_mask", "model_type": "decoder", "hidden_state_order": "seq_batch_dim", "random_ltd_schedule": { "min_value": 128, "max_value": 2048, "schedule_type":"fixed_linear", "schedule_config": { "require_steps": 200000, "seq_per_step": 16 } } } }, "data_sampling": { "enabled": true, "num_epochs": 1, "num_workers": 0, "curriculum_learning": { "enabled": true, "data_cluster_path": "/path/to/data_clusters", "curriculum_metrics": { "vocabularyrarity": { "index_to_sample_path": "/path/to/index_to_sample", "index_to_metric_path": "/path/to/index_to_metric", "difficulty_type": "percentile", "clustering_type": "schedule_based", "min_difficulty": 1, "max_difficulty": 100, "schedule_type": "fixed_root", "schedule_config": { "total_curriculum_step": 110000, "difficulty_step": 1, "root_degree": 2 } } } } } } data_efficiency: [dictionary] Fields Value Default enabled: [boolean] Enable data efficiency or not. false seed: [integer] Random seed for data sampling. 1234 data_routing: [dictionary] Configs for data routing techniques. N/A data_sampling: [dictionary] Configs for data sampling techniques. N/A data_routing: [dictionary] Fields Value Default enabled: [boolean] Enable data routing techniques or not. false random_ltd: [dictionary] Configs for random-LTD technique. N/A data_sampling: [dictionary] Fields Value Default enabled: [boolean] Enable data sampling techniques or not. false num_epochs: [integer] At most how many epoches of the original dataset will be iterated. 1000 num_workers: [integer] Data loader number of workers. 0 curriculum_learning: [dictionary] Configs for curriculum learing technique. N/A random_ltd: [dictionary] Fields Value Default enabled: [boolean] Enable random-LTD technique or not. false total_layer_num: [integer] The number of layer (or the depth) for the pretraining/fine-tuning model. N/A random_ltd_layer_num: [integer] The number of layers that will be applied with random-LTD. N/A random_ltd_layer_id: [list] The exact layer_id that will be applied with random-LTD. The length of this list must be the same as random_ltd_layer_num. N/A model_mask_name: [str] The variable name of the attention_mask. Different libraries have different names, such as att_mask. For huggingface model, it’s named “attention_mask”. Users need to check the forward function in the original model files. If the attention mask input in the original model’s forward function is not a keyword/named argument (e.g., attention_mask=None), user would need to change it to a keyword/named argument and provide that keyword as model_mask_name. N/A model_type: [str] Users need to identify whether the model is decoder or encoder. Currently we only support these two. N/A hidden_state_order: [str] Users need to know the input order of the hidden state tensor. Normally, it’s batch, sequence and then the hidden dimension, which is batch_seq_dim. Somethings, the order between batch and sequence will be switch like seq_batch_dim. Currently, we support these two. N/A random_ltd_schedule: [dictionary] The schedule of the effective sequence length after token dropping. It’s a linear function where random-LTD gradually drops less tokens and increases effective sequence length. N/A min_value: [integer] The initial effective sequence length (after token dropping) at step/iteration 0. N/A max_value: [integer] The max effective sequence length (usually the case without any token dropping). Usually this is set as baseline’s seqlen. N/A schedule_type: [str] The sequence length follows a linear increasing function starting from min_value and reaching max_value. We currently only support this type. N/A schedule_config: [dictionary] Configs for the linear increasing function. N/A require_steps: [integer] How many iterations will be needed to reach max_value from min_value. N/A seq_per_step: [integer] At any time, the effective sequence length be multiple of this seq_per_step. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. N/A curriculum_learning: [dictionary] Fields Value Default enabled: [boolean] Enable curriculum learing technique or not. false data_cluster_path: [str] Path to directory where curriculum learning will store the indexes of data samples within the same difficulty ranges. N/A curriculum_metrics: [dictionary] This dictionary includes all desired curriculum metrics and their configs. Each metric will be a separate sub-dictionary, where the key is the metric name and the values are configs below. N/A index_to_sample_path: [str] Path to the index_to_sample file generated during offline data analysis. Note that data analysis will generate two kinds of index_to_sample files: The metric_name_index_to_sample_percentile_merged file is a concatenated index for perf improvement, but it only works when you set difficulty_type=percentile. If you use difficulty_type=value, you need to change this to use the metric_name_index_to_sample file. N/A index_to_metric_path: [str] Path to the index_to_metric_path file generated during offline data analysis. N/A difficulty_type: [str] During training, how to increase the max accepted difficulty. Currently support value (increase by absolute value) and percentile (increase by difficulty percentile). N/A clustering_type: [str] Currently support schedule_based (cluster data based on the difficulty schedule (pacing function) below) and single_cluster (no clustering required and probably CL is achieved by data postprocessing, such as sequence length truncation). N/A min_difficulty: [integer] Starting difficulty at first step. When difficulty_type=value the min_difficulty is an absolute difficulty value. When difficulty_type=percentile the min_difficulty is a difficulty percentile value. N/A max_difficulty: [integer] Final max difficulty. When difficulty_type=value the max_difficulty is an absolute difficulty value. When difficulty_type=percentile the max_difficulty is a difficulty percentile value. N/A schedule_type: [str] The difficulty schedule (pacing function) that defines how the max accepted difficulty increases from min_difficulty to max_difficulty during training. Currently support fixed_linear, fixed_root, fixed_discrete, and custom. N/A schedule_config: [dictionary] Configs for the pacing function. When schedule_type=custom this dictionary is not necessary. Instead user needs to provide a callback function (via the set_custom_curriculum_learning_schedule API in deepspeed/runtime/engine.py) which will update the max accepted difficulty during training. Configs below are all belongs to schedule_config. N/A total_curriculum_step: [integer] How many steps the curriculum learning takes to go from min difficulty to max difficulty. Used by fixed_linear and fixed_root schedule. N/A difficulty_step: [integer] The max accepted difficulty level determined every step must be a multiple of this difficulty_step. This is used to ensure the use of NVIDIA Tensor Core acceleration (requires multiple of 8 (FP16) or 16 (INT8)). Used by fixed_linear and fixed_root schedule. N/A root_degree: [integer] The degree of the root function. Degree of 2 means square root and degree of 3 means cube root. Degree of 1 is equivalent to linear. Used by fixed_root schedule. N/A difficulty: [list] List of max accepted difficulty levels to be used during schedule. Used by fixed_discrete schedule. N/A max_step: [list] List of which step to change max accepted difficulty level. Used by fixed_discrete schedule. N/A Curriculum Learning Note: On 12/12/2022, we released DeepSpeed Data Efficiency Library which provides a more general curriculum learning support. This legacy curriculum learning feature below is still supported but we recommend to use the Data Efficiency Library. "curriculum_learning": { "enabled": true, "curriculum_type": "seqlen", "min_difficulty": 8, "max_difficulty": 1024, "schedule_type": "fixed_linear", "schedule_config": { "total_curriculum_step": 40000, "difficulty_step": 8 } } enabled: [boolean] Description Default Set to true to enable curriculum learning false curriculum_type: [string] Description Default Type of curriculum difficulty metric. Currently support seqlen. N/A min_difficulty: [integer] Description Default The starting difficulty level N/A max_difficulty: [integer] Description Default The ending difficulty level N/A schedule_type: [string] Description Default Type of curriculum schedule. Currently support fixed_linear, fixed_root, and fixed_discrete. N/A total_curriculum_step: [integer] Description Default Total number of steps for the curriculum learning. One of the schedule_config when the fixed_linear and fixed_root schedule_type are used. N/A difficulty_step: [integer] Description Default At any time, the curriculum learning difficulty must be multiple of this difficulty_step. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. One of the schedule_config when the fixed_linear and fixed_root schedule_type are used. N/A root_degree: [integer] Description Default Root degree of the curriculum schedule function. One of the schedule_config when the fixed_root schedule_type is used. N/A difficulty: [list of integer] Description Default List of difficulty levels to be used during schedule. One of the schedule_config when the fixed_discrete schedule_type is used. N/A max_step: [list of integer] Description Default List of which step to change difficulty level. One of the schedule_config when the fixed_discrete schedule_type is used. N/A Monitoring Module Note: Deepspeed logs to TensorBoard through PyTorch. Logging to TensorBoard requires that the tensorboard package is installed (read more in the PyTorch documentation). Note: Logging to WandB requires that the wandb package is installed (read more in the WandB documentation). Note: Logging to Comet requires that the comet_ml package is installed (read more in the Comet documentation). Deepspeed’s Monitor module can log training details into a Tensorboard-compatible file, to WandB, to Comet or to simple CSV files. Below is an overview of what DeepSpeed will log automatically. Field Description Conditions Train/Samples/train_loss The training loss. None Train/Samples/lr The learning rate during training. None Train/Samples/loss_scale The loss scale when training using fp16. fp16 must be enabled. Train/Eigenvalues/ModelBlockParam_{i} Eigen values per param block. eigenvalue must be enabled. Train/Samples/elapsed_time_ms_forward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward_inner The backward time that does not include the gradient reduction time. Only in cases where the gradient reduction is not overlapped, if it is overlapped then the inner time should be about the same as the entire backward time. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_backward_allreduce The global duration of the allreduce operation. flops_profiler.enabled or wall_clock_breakdown. Train/Samples/elapsed_time_ms_step The optimizer step time flops_profiler.enabled or wall_clock_breakdown. tensorboard: [dictionary] Fields Value Default enabled Whether logging to Tensorboard is enabled. false output_path Path to where the Tensorboard logs will be written. If None, the output path is set under the training script’s launching path. null job_name Name for the current job. This will become a new directory inside output_path. "DeepSpeedJobName" Example of tensorboard configuration: "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } wandb: [dictionary] Fields Value Default enabled Whether logging to WandB is enabled. false group Name for the WandB group. This can be used to group together runs. None team Name for the WandB team. None project Name for the WandB project. deepspeed Example of wandb configuration: "wandb": { "enabled": true, "group": "my_group", "team": "my_team", "project": "my_project" } comet: [dictionary] Fields Value Default enabled Whether logging to Comet is enabled. false workspace Comet workspace name. None project Comet project name. None samples_log_interval Metrics will be submitted to Comet after processing every samples_log_intervas samples. 100 experiment_name The name for comet experiment to be used for logging. None api_key Comet API key. It’s not recommended to save the Comet API Key in code. None experiment_key The key for comet experiment to be used for logging. Must be an alphanumeric string whose length is between 32 and 50 characters. None online If True, the data will be logged to Comet server, otherwise it will be stored locally in offline experiment. Default is True. None mode Control how the Comet experiment is started. “get”: Continue logging to an existing experiment identified by the experiment_key value. “create”: Always creates of a new experiment, useful for HPO sweeps. “get_or_create” (default): Starts a fresh experiment if required, or persists logging to an existing one. None Example of comet configuration: "comet": { "enabled": true, "workspace": "my_workspace", "project": "my_project", "samples_log_interval": 50, "experiment_name": "llama-fine-tuning", "experiment_key": "0c4a1c4a90664f2a8084e600b19a9d7", "online": false, "mode": "get", } csv_monitor: [dictionary] Fields Value Default enabled Whether logging to local CSV files is enabled. false output_path Path to where the csv files will be written. If None, the output path is set under the training script’s launching path. null job_name Name for the current job. This will become a new directory inside output_path "DeepSpeedJobName" Example of csv_monitor configuration: "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } Elastic Training Config (V0.1 and V0.2) "elasticity": { "enabled": true, "max_train_batch_size": "seqlen", "micro_batch_sizes": 8, "min_gpus": 1024, "max_gpus": "fixed_linear", "min_time": "seqlen", "version": 8, "ignore_non_elastic_batch_info": 1024, "num_gpus_per_node": "fixed_linear", "model_parallel_size": MODEL_PARALLEL_SIZE } Field Description Default enabled Enables computation of global batch size in elastic training. false max_train_batch_size Max acceptable batch size can be used in training. 2000 micro_batch_sizes Acceptable micro batch sizes, same as train_micro_batch_size_per_gpu [2,4,6] min_gpus Min number of GPUs to search over when computing highly composite batch size in v0.1 and v0.2. 1 max_gpus Max number of GPUs to search over when computing highly composite batch size in v0.1 and v0.2. 10000 min_time Minimum running time (minutes) before the scheduler will scale again (only used in v0.1). 0 implies it’s unknown 0 prefer_large_batch When finding a suitable batch size, attempt to find one that is closest to the max train batch size given. true version Version of elastic logic to use. 0.2 ignore_non_elastic_batch_info Ignore all batch info provided outside the elastic config. To reduce confusion, we require all batch related info to be given in elastic config only. false num_gpus_per_node Number of GPUs per node. This information is used by v0.2 to support model-parallel training (only used by v0.2) 1 model_parallel_size Tensor or model parallel size (only used by v0.2) 1 Communication Logging DeepSpeed provides a flexible communication logging tool which can automatically detect and record communication operations launched via deepspeed.comm. NOTE: All logging communication calls are synchronized in order to provide accurate timing information. This may hamper performance if your model heavily uses asynchronous communication operations. Once the logs are populated, they can be summarized with deepspeed.comm.log_summary(). For more detail and example usage, see the tutorial comms_logger: [dictionary] Fields Value Default enabled Whether communication logging is enabled. false verbose Whether to immediately print every communication operation false prof_all Whether to profile all operations. true debug Appends the caller function to each communication operation’s log_name. false prof_ops A list of communication operations to log (only the specified ops will be profiled). [] Example of recommended comms_logger configuration: "comms_logger": { "enabled": true, "verbose": false, "prof_all": true, "debug": false } Example of comms_logger configuration for logging specific operations only: "comms_logger": { "enabled": true, "verbose": false, "prof_all": false, "debug": false, "prof_ops": ["all_reduce", "all_gather"] } Compression Note: Compression has seven different components, including layer reduction, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning. We explain them one by one with simple json examples. Read more about how to use the DeepSpeed Compression library in our tutorial. Layer Reduction Note: Layer reduction works much better when using knowledage distillation (learn more in our tutorial): "compression_training": { "layer_reduction": { "enabled": true, "keep_number_layer": 5, "module_name_prefix": "bert.encoder.layer", "teacher_layer": [ 2, 4, 6, 8, 10 ], "other_module_name": [ "bert.pooler", "bert.embeddings", "classifier" ] } } layer_reduction: [dictionary] Fields Value Default enabled: [boolean] Enable layer reduction or not. false keep_number_layer: [list] The number of layer in the model to be kept. N/A module_name_prefix: [str] The (uniform) name prefix of the model’s modules of which the associated weight parameters are to be reinitialized. N/A teacher_layer: [list] The layer of the weight parameters are to be reinitialized. The length of the list equals to ‘keep_number_layer’. N/A other_module_name: [list] The name of modules of which the associated weight parameters are to be reinitialized. It is an complemenatory or alternative of module_name_prefix. For instance, “other_module_name”: [“bert.encoder.layer.2”,”bert.encoder.layer.4”] equals to “module_name_prefix”:”bert.encoder.layer” and “teacher_layer”: [2,4]. N/A Weight Quantization "compression_training": { "weight_quantization": { "shared_parameters":{ "enabled": true, "quantizer_kernel": false, "schedule_offset": 0, "quantize_groups": 1, "quantize_verbose": false, "quantization_type": "symmetric", "rounding": "nearest", "quantize_weight_in_forward": false, "fp16_mixed_quantize":{ "enabled": false, "quantize_change_ratio": 0.001 } }, "different_groups":{ "wq1": { "params": { "start_bits": 8, "target_bits": 8, "quantization_period": 50 }, "modules": [ "attention.self", "intermediate" ] }, "wq2": { "params": { "start_bits": 4, "target_bits": 4, "quantization_period": 50 }, "modules": [ "attention.output" ] } } } } shared_parameters: [dictionary] Shared parameters for all weight quantization groups. Fields Value Default enabled: [boolean] Enable weight quantization or not. false quantizer_kernel: [boolean] Use DeepSpeed quantization kernel for >=4 bit quantization. This can only be enabled when using DeepSpeed FP16 optimizer. false schedule_offset: [integer] Enable weight quantization after scheduled steps (can be treated as warmup steps). 0 quantize_groups: [integer] Split the weight matrix into different number of groups, and each of them has its own scaling factor. 1 quantize_verbose: [boolean] Print the quantization related logs. false quantization_type: [string] Choose the quantization algorithm, symmetric or asymmetric. "symmetric" rounding: [string] Rounding algorithm associated with quantization, nearest or stochastic. "nearest" quantize_weight_in_forward: [boolean] Quantize weight in optimizer or forward step, must set to be true for FP32 optimizer training. false fp16_mixed_quantize: [dictionary] Using the value mixed by FP16 value and the quantized value. N/A enabled: [boolean] Whether fp16 mixed quantization is enabled. false quantize_change_ratio: [float] Initial quantize value ratio, will gradually increase to 1. 0.001 different_groups: [dictionary] Different quantization sets, this is used for different quantization parameters. In this example, we give two different sets. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] start_bits: [integer] Quantization starting bits, will gradaully reduce to target bits. 8 target_bits: [integer] Quantization target bits, need to be <= start_bits. 8 quantization_period: [integer] For every n steps, the quantization bits will be reduce by 1. 1 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Activation Quantization "compression_training": { "activation_quantization": { "shared_parameters":{ "enabled": true, "quantization_type": "asymmetric", "range_calibration": "dynamic", "schedule_offset": 50 }, "different_groups":{ "aq1": { "params": { "bits": 8 }, "modules": [ "attention.output" ] } } } shared_parameters: [dictionary] Shared parameters for all activation quantization groups. Fields Value Default enabled: [boolean] Enable activation quantization or not. false quantization_type: [string] Choose the quantization algorithm, symmetric or asymmetric. "symmetric" range_calibration: [string] Using dynamic (per token or per image) or static (fixed min/max using momentum) for inference. "static" schedule_offset: [integer] Enable activation quantization after scheduled steps (can be treated as warmup steps). 0 different_groups: [dictionary] Different quantization sets, this is used for different quantization parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] bits: [integer] Number of bits used for activation target bits, need to be >= 4. 8 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Sparse Pruning "compression_training": { "sparse_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 30, "method": "l1" }, "different_groups":{ "sp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "attention.self" ] } } } } "compression_training": { "sparse_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 30, "schedule_offset_end": 90, "schedule_offset_stride": 15, "method": "snip_momentum", "block_pattern": "4x1", "dense_ratio": 0.4, "excluded_modules": ['classifier', 'pooler'] }, "different_groups":{ } } } shared_parameters: [dictionary] Shared parameters for all sparse pruning groups. Fields Value Default enabled: [boolean] Enable sparse pruning or not. false schedule_offset: [integer] Enable sparse pruning after scheduled steps (can be treated as warmup steps). 0 schedule_offset_end: [integer] Disable sparse pruning after scheduled steps, mandotory for snip_momentum. 0 schedule_offset_stride: [integer] The stride of pruning on training steps, mandotory for snip_momentum. "1" method: [string] Choose different pruning methods, l1 (static, magnitude based), topk (dynamic, learnable) or snip_momentum (structured pruning). "l1" block_pattern: [string] Choose different structured pruning block patterns, NxM or N:M (N and M are integers). For instance, “4x1” or “2:4” are common block patterns, mandotory for snip_momentum. "4x1" dense_ratio: [float] Used to get the targeted global sparsity ratio, mandotory for snip_momentum. "0.1" excluded_modules: [list] Excluded pruning scope on some special modules like output layer. [] different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Note for snip_momentum method, you can leave it as empty. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" Row Pruning Note: Row Pruning is a feature designed for two back-to-back linear layers (e.g., Feed Forward Network in Transformers). As such, we suggested use row pruning for the first linear layer (i.e., the intermediate.dense layer for BERT). Reducing the row dimension of this matrix can help reducing the column of the follow-up matrix (i.e., layer.\\w+.output.dense layer for BERT). It should also work for other linear layers as well. "compression_training": { "row_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 20, "method": "topk" }, "different_groups":{ "rp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "intermediate.dense" ], "related_modules":[ ["layer.\\w+.output.dense"] ] } } } } shared_parameters: [dictionary] Shared parameters for all row pruning groups. Fields Value Default enabled: [boolean] Enable row pruning or not. false schedule_offset: [integer] Enable row pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods, l1 (static, magnitude based) or topk (dynamic, learnable). "l1" different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" related_modules: [list[list]] Related module to the row pruned module, which can be performed column pruning. None Head Pruning Note: Head Pruning is a feature designed for two attention layers (e.g., Multi Head Attention in Transformers). For now, it can only be applied to output matrix of the Transformer (i.e., attention.output.dense in BERT). Pruning the output matrix can lead to the pruning of Query/Key/Value matrix as well. "compression_training": { "head_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 10, "method": "topk", "num_heads": 12 }, "different_groups":{ "rp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "attention.output.dense" ], "related_modules":[ ["self.query", "self.key", "self.value"] ] } } } } shared_parameters: [dictionary] Shared parameters for all head pruning groups. Fields Value Default enabled: [boolean] Enable head pruning or not. false schedule_offset: [integer] Enable head pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods. For now, we only support topk (dynamic, learnable). "topk" num_heads: [int] Number of heads (must be provided by user). N/A different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All Linear and CONV2D layers" related_modules: [list[list]] Related module (Usually Q/K/V) to the head pruned module (i.e., the output matrix). For now, this feature only works for BERT. None Channel Pruning Note: Channel Pruning is a feature designed for two back-to-back CONV2d layers (e.g., residual connection in ResNet). As such, we suggested use channel pruning for the first CONV2d layer. Reducing the number of output channels of this layer can help reducing the number of input channels the follow-up layer. It should also work for other CONV2d layers as well. "compression_training": { "channel_pruning":{ "shared_parameters":{ "enabled": true, "schedule_offset": 0, "method": "topk" }, "different_groups":{ "cp1": { "params": { "dense_ratio": 0.5 }, "modules": [ "layer....conv1" ], "related_modules": [ ["layer....conv2", "layer....bn1"] ] } } } } shared_parameters: [dictionary] Shared parameters for all channel pruning groups. Fields Value Default enabled: [boolean] Enable channel pruning or not. false schedule_offset: [integer] Enable channel pruning after scheduled steps (can be treated as warmup steps). 0 method: [string] Choose different pruning methods, l1 (static, magnitude based) or topk (dynamic, learnable). "l1" different_groups: [dictionary] Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Fields Value Default params: [dictionary] dense_ratio: [float] The percentage of weights to keep after pruning. 0.5 modules: [list] Scope of weight parameters associated to the params setting. "All CONV2D layers" related_modules: [list[list]] Related module to the channel pruned module. None Checkpoint options "checkpoint": { "tag_validation"="Warn", "load_universal"=false, "use_node_local_storage"=false, "parallel_write":{ "pipeline_stage": false } } tag_validation: [“Ignore” “Warn” “Fail”] Description Default Enables level of checking to ensure checkpoint tags are consistent across all ranks. Useful when restoring with different world sizes. “Warn” load_universal: [boolean] Description Default Load the latest checkpoint for all. false use_node_local_storage: [boolean] Description Default If true DeepSpeed will store model parameter states and checkpoint states based on local rank allowing checkpoints to be loaded without access to a shared filesystem. false pipeline_stage: [boolean] Description Default Use pipeline stages to parallelize the writing of checkpoints. false Data Type options "data_types": { "grad_accum_dtype"=["fp32"|"fp16"|"bf16"] } } grad_accum_dtype: [“fp32” “fp16” “bf16”] Description Default Specifies the data type in which to do gradient accumulation. If None the default is to match the model type. None
```
32
```
**Pattern 8:** Monitor Contents Overview Usage Automatic Monitoring Custom Monitoring In this tutorial, we introduce the DeepSpeed Monitor and provide examples of its usage. Overview Usage Overview Monitoring model and system metrics during training is vital to ensure hardware resources are fully utilized. The DeepSpeed Monitor enables live logging of metrics through one or more monitoring backends such as PyTorch’s TensorBoard, WandB, Comet and simple CSV files. Below is a live monitoring view for TensorBoard: Below is a live monitoring view for WandB: Below is a live monitoring view for Comet: Usage The DeepSpeed Monitor is configured within the deepspeed configuration file. DeepSpeed will automatically monitor key training metrics, including those tracked with the wall_clock_breakdown configuration option. In addition, users can log their own custom events and metrics. Automatic Monitoring Custom Monitoring Automatic Monitoring When using DeepSpeed for model training, the Monitor can be configured in the DeepSpeed configuration file. No explicit API calls are needed to use the Monitor. The Monitor can be enabled by adding the following field to DeepSpeed’s configuration json file. Refer to Monitoring for details. { "tensorboard": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } "wandb": { "enabled": true, "team": "my_team", "group": "my_group", "project": "my_project" } "comet": { "enabled": true, "project": "my_project", "experiment_name": "my_experiment" } "csv_monitor": { "enabled": true, "output_path": "output/ds_logs/", "job_name": "train_bert" } } DeepSpeed will automatically log to all available and enabled monitoring backends listed in the config, and will generate live monitoring views such as those listed above. Custom Monitoring In addition to automatic monitoring, users can log their own custom metrics in client scripts. Currently, there are two ways to initialize Monitor objects: (Recommended) - Create a MonitorMaster(ds_config.monitor_config) object, which automatically initializes all monitor backends present in the DeepSpeed configuration Create a specific TensorBoardMonitor(ds_config.monitor_config), WandbMonitor(ds_config.monitor_config), csvMonitor(ds_config.monitor_config) object which will only initialize a specific monitor backend present in the DeepSpeed configuration The steps to create a custom monitor are as follows: Add import to your desired Monitor Initialize monitor with DeepSpeed config’s monitor_config Create a list of one or more 3-tuples in the format [("label", value, ds_engine.global_samples), ...]* Call monitor.write_events on the list from step 3 * Note - Some Monitor backends don’t support mixed sample values. Be sure to use your DeepSpeed engine object’s global_samples attribute in each 3-tuple For example usage, see the following modified DeepSpeedExamples/cifar example: # Step 1: Import monitor (and DeepSpeed config, if needed) from deepspeed.monitor.monitor import MonitorMaster from deepspeed.runtime.config import DeepSpeedConfig # Step 2: Initialized monitor with DeepSpeed config (get DeepSpeed config object, if needed) ds_config = DeepSpeedConfig("ds_config.json") monitor = MonitorMaster(ds_config.monitor_config) for epoch in range(2): running_loss = 0.0 for i, data in enumerate(trainloader): pre = time.time() inputs, labels = data[0].to(model_engine.local_rank), data[1].to( model_engine.local_rank) if fp16: inputs = inputs.half() outputs = model_engine(inputs) loss = criterion(outputs, labels) model_engine.backward(loss) model_engine.step() post = time.time() # Step 3: Create list of 3-tuple records (single entry in this case) events = [("Time per step", post-pre, model_engine.global_samples)] # Step 4: Call monitor.write_events on the list from step 3 monitor.write_events(events) Updated: November 5, 2025 Previous Next
```
wall_clock_breakdown
```
### Example Code Patterns
**Example 1** (python):
```python
### Create aio_handle
from deepspeed.ops.op_builder import AsyncIOBuilder
aio_handle = AsyncIOBuilder().load().aio_handle()
```
## Reference Files
This skill includes comprehensive documentation in `references/`:
- **08.md** - 08 documentation
- **09.md** - 09 documentation
- **2020.md** - 2020 documentation
- **2023.md** - 2023 documentation
- **assets.md** - Assets documentation
- **mii.md** - Mii documentation
- **other.md** - Other documentation
- **tutorials.md** - Tutorials documentation
Use `view` to read specific reference files when detailed information is needed.
## Working with This Skill
### For Beginners
Start with the getting_started or tutorials reference files for foundational concepts.
### For Specific Features
Use the appropriate category reference file (api, guides, etc.) for detailed information.
### For Code Examples
The quick reference section above contains common patterns extracted from the official docs.
## Resources
### references/
Organized documentation extracted from official sources. These files contain:
- Detailed explanations
- Code examples with language annotations
- Links to original documentation
- Table of contents for quick navigation
### scripts/
Add helper scripts here for common automation tasks.
### assets/
Add templates, boilerplate, or example projects here.
## Notes
- This skill was automatically generated from official documentation
- Reference files preserve the structure and examples from source docs
- Code examples include language detection for better syntax highlighting
- Quick reference patterns are extracted from common usage examples in the docs
## Updating
To refresh this skill with updated documentation:
1. Re-run the scraper with the same configuration
2. The skill will be rebuilt with the latest information
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### references/08.md
```markdown
# Deepspeed - 08
**Pages:** 1
---
## DeepSpeed powers 8x larger MoE model training with high performance
**URL:** https://www.deepspeed.ai/2021/08/17/deepspeed-moe.html
**Contents:**
- DeepSpeed powers 8x larger MoE model training with high performance
- Contents
Updated: August 17, 2021
---
```
### references/09.md
```markdown
# Deepspeed - 09
**Pages:** 2
---
## DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times
**URL:** https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html
**Contents:**
- DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times
- Contents
- MoE based NLG model architecture
- MoE training infrastructure and dataset
- MoE leads to better quality for NLG models
- Same quality with 5x less training cost
- MoE for Inference
- Conclusion and Release
- Acknowledgement
Autoregressive transformer-based natural language generation (referred to as NLG in the rest of the blog) models can offer convincing solutions to a broad range of language tasks from document summarization, headline generation, question and answering to even generating code in a wide variety of programming languages. Due to the general applicability of these models, improving their quality has been of great interest for both academia and industry alike.
The quality of NLG improves with the increase in model size. However, today we are getting close to the limit of what the current generation of hardware can do. The Megatron-Turing NLG 530B model took 3 months to train on over 2K A100 GPUs on the NVIDIA Selene Supercomputer, consuming over 3 million GPU hours. Another 3 to 5 times of increase in model size would be infeasible within a reasonable timeframe. Given the exorbitant compute resources required to train the state-of-art NLG models, a natural question to ask is: “Is it possible to make non-trivial improvement to model quality without increasing the compute cost?” Or equivalently, “Is it possible to produce model with similar quality using 3 to 5 times less resources?”
Recent works like GShard and Switch Transformers have shown that Mixture of Experts (MoE) model structure reduces large model training cost significantly for transformer-based encoder-decoder models. An MoE model contains a set of sparsely gated experts. During training and inference, only a subset of these experts is activated for each input token. Therefore, the model could scale to billions of parameters without a proportional increase in the computation. Despite showing promising results, the effectiveness of MoE for the much more computation intensive NLG family models remains mostly unknown.
Given the tremendous compute and energy requirements for training NLG family of models, we explore the opportunities that MoE presents to reduce their training cost. We show that MoE can be applied to NLG family of models to significantly improve their model quality with the same training cost. Alternatively, it can achieve 5x reduction in training cost to achieve the same model quality of a dense NLG model. For example, by applying MoE we achieved the model quality of a 6.7B parameter dense NLG model at the cost of training a 1.3B parameter dense model, thanks to the sparse structure of MoE.
Assuming the scaling holds, the results have the potential to completely transform the large model training landscape in terms of cost. For example, a trillion-parameter dense model can be potentially trained at the cost of a 200B parameter (like GPT-3) sized dense model, translating to millions of dollars in training cost reduction and energy savings (Brown et al., 2020, Language models are few-shot learners).
To create an MoE based NLG model we studied the GPT like transformer-based NLG model. To complete training in a reasonable timeframe, the following models are selected: 350M (24 layers, 1024 hidden size, 16 attention heads), 1.3B (24 layers, 2048 hidden size, 16 attention heads), and 6.7B (32 layers, 4096 hidden size, 32 attention heads). We use “350M+MoE-128” to denote a MoE model that uses 350M dense model as the base model and adds 128 experts on every other feedforward layer. That is to say, there are in total 12 MoE layers for both 350M+MoE-128 and 1.3B+MoE-128.
We use a gating function to activate a subset of experts in the MoE layer for each token. Specifically, in our experiments, only the top-1 expert is selected. Therefore, during both training and inference, our MoE model will have the same number of parameters to be activated for each token as their dense part. For example, our 1.3B+MoE-128 will only activate 1.3B parameter per token, and the amount of training computation per token will be similar to a 1.3B dense model.
We pre-trained both the dense and MoE version of the above models using DeepSpeed on 128 A100 GPUs. DeepSpeed uses a combination of data parallel and expert parallel training to effectively scale the MoE model training.
We used the same training data as described in the MT-NLG blog. For a fair comparison, we use 300B tokens to train both the dense model and the MoE model.
Figure 1 shows that the validation loss for the MoE versions of the model is significantly better than their dense counter parts. Furthermore, notice that the validation loss of the MoE model, 350M+MoE-128, is on par with the validation loss of the 1.3B dense model with 4x larger base. This is also true for 1.3B+MoE-128 in comparison with 6.7B dense model with 5x larger base. Furthermore, the model quality is on par not only for the validation loss but also for a wide variety of 6 ZeRO-shot evaluation tasks as shown in Table 1, demonstrating that these models in fact have very similar model quality.
Figure 1: Token-wise validation loss curves for dense and MoE NLG models with different model sizes.
Table 1: ZeRO-shot evaluation results (last six columns) for different dense and MoE NLG models. All ZeRO-shot evaluation results use the accuracy metric.
As we saw from the results above, adding MoE with 128 experts to the NLG model significantly improves the quality of the NLG model. However, these experts do not change the compute requirements of the model as each token is only processed by a single expert. Therefore, the compute requirements for dense model and its corresponding MoE models with the same base are similar.
More concretely, a 1.3B+MoE-128 model training requires roughly the same amount of compute operations as 1.3B dense, while offering much better model quality. Furthermore, our results show that by applying MoE we can achieve the model quality of a 6.7B parameter dense model at the training cost of 1.3B parameter dense model, resulting in an effective training compute reduction of 5x.
This compute cost reduction can directly be translated into throughput gain, training time and training cost reduction by leveraging the efficient DeepSpeed MoE training system. Table 2 shows the training throughput of the 1.3B+MoE-128 model in comparison to the 6.7B dense model on 128 NVIDIA A100 GPUs.
Table 2: Training throughput (on 128 A100 GPUs) comparing MoE based model vs dense model that can both achieve the same model quality.
The training cost reduction of MoE is not free and comes at the expense of increasing the total number of parameters required to achieve the same model quality compared to dense models. The 1.3B+MoE-128 have roughly 8x the number of parameters (52B) compared to the 6.7B dense model. So, does this mean inference will be 8x slower than the dense model, since inference is generally limited by the time taken to read all the model parameters, especially for small batch sizes?
Not quite. Note that in the 1.3B+MoE-128 model, each token is processed by a unique expert per MoE layer, and the total number of parameters used in processing the token is just 1.3B. This can in theory result in even faster inference than the quality-equivalent dense 6.7B model because of 5x less compute and parameter read. In reality though, the number of tokens in a batch during inference is generally larger than 1. Inferencing, a long sequence length or a non-unit batch size may require loading all the experts, increasing the total number of parameters loaded by 8x compared to the quality-equivalent dense model. Therefore, achieving good inference performance with MoE is still challenging even though the parameters used and the computation incurred per token is small compared to the quality-equivalent dense model.
Nonetheless, we believe that it is possible to use different forms of parallelism to leverage massive memory bandwidth by scaling across a large number of devices to speed up MoE inference, making it comparable or faster than quality-equivalent dense models for extended inference scenarios and creating opportunities to make MoE based models cost efficient for inference in addition to training.
We demonstrate that MoE based models can be applied to NLG task, reducing the training cost by 5x compared to dense, autoregressive transformer-based models like GPT-3 and MT-NLG 530B. Through MoE based low-cost training we hope to make high quality language models accessible to a broad audience, even with limited compute resources.
To this end we are releasing our end-to-end pipeline for training MoE based NLG models, along with specific example scripts and tutorial to help get started with our pipeline. We look forward to the application and the innovations that this may bring to the deep learning community.
This work was done in collaboration with Brandon Norick, Zhun Liu, Xia Song from the Turing Team, and Young Jin Kim, Alex Muzio, Hany Hassan Awadalla from Z-Code Team. We also thank Luis Vargas, Umesh Madan, Gopi Kumar, Andrey Proskurin and Mikhail Parakhin for their continuous support and guidance.
Updated: December 9, 2021
---
## ZeRO-Inference: Democratizing massive model inference
**URL:** https://www.deepspeed.ai/2022/09/09/zero-inference.html
**Contents:**
- ZeRO-Inference: Democratizing massive model inference
- Contents
- Introduction
- How ZeRO-Inference works
- Offload all model weights
- Optimizations
- Alternative approach: Host some model weights in GPU memory
- Model Scaling on 1 GPU
- Token Generation Performance
- Models
The current trends in artificial intelligence (AI) domains such as image, speech, and natural language, demonstrate that model quality can be improved by increasing model size. In natural language processing, for example, the state-of-the-art (SOTA) model has grown from 300 million parameters (Bert-Large) to 500 billion parameters (Megatron-Turing-530B) in less than four years. However, this dramatic growth in model sizes has significantly increased the GPU cost to train, finetune or inference these models, making them unaffordable to most users. To democratize access to AI innovations, large organizations, such as Hugging Face (BigScience), Meta, and Yandex have recently publicly released pre-trained massive models. Unfortunately, even these publicly available models are not broadly usable because many users cannot afford the dozens of GPUs required to fit them for inference computation. For example, half-precision inference computation on Megatron-Turing-530B (SOTA model for natural language) requires at least 40 A100-40GB GPUs, which is unaffordable to many students, model scientists, hobbyists, and small businesses that could benefit from using these powerful models. And so, a real concern is that if the dramatic increase in model sizes continues, then a growing fraction of users could be excluded from the benefits of these AI innovations.
DeepSpeed, a part of Microsoft’s AI at Scale Initiative, has developed the ZeRO-Inference technology to address these obstacles to AI democratization. ZeRO-Inference comes from the family of ZeRO technologies, which are a collection of powerful memory and parallelism optimizations for efficient large scale model training and inference on modern GPU clusters. DeepSpeed had previously developed ZeRO-Infinity, a technology that leverages heterogeneous memory (GPU, CPU, and NVMe) to efficiently scale model training to extreme levels. ZeRO-Inference adapts and optimizes ZeRO-Infinity techniques for model inference on GPUs by hosting the model weights in CPU or NVMe memory, thus hosting no (zero) weights in GPU. This approach is inspired by the observation that the aggregate capacity of CPU and NVMe memories in most commodity computing devices (e.g., laptops, desktops, workstations, etc.) is on the order of terabytes and sufficient to host the largest known models for inference computation. By leveraging this non-GPU memory, ZeRO-Inference enables inference computation of massive models (with hundreds of billions of parameters) on as few as a single GPU, thereby making massive model inference accessible to almost everyone. Moreover, by dramatically reducing GPU memory requirements with CPU or NVMe memory which are significantly cheaper, it significantly reduces the cost of massive model inference, offering an affordable inference path to SOTA models.
The massive computational requirements of large model inference means that accelerators like GPUs are required for efficient execution. Therefore, an important design decision for large model inference on limited GPU budget is how to apportion GPU memory among model weights, inference inputs, and intermediate results.
ZeRO-Inference pins the entire model weights in CPU or NVMe (whichever is sufficient to accommodate the full model) and streams the weights layer-by-layer into the GPU for inference computation. After computing a layer, the outputs are retained in GPU memory as inputs for the next layer, while memory consumed by the layer weights is released for use by the next layer. Thus, model inference time is composed of the time to compute the layers on GPU, and the time to fetch the layers over PCIe. For large model inference, this approach provides scaling and efficiency benefits, as explained below.
ZeRO-Inference offers scaling benefits in two ways. First, by keeping just one (or a few) model layers in GPU memory at any time, ZeRO-Inference significantly reduces the amount of GPU memory required to inference massive models. For current SOTA models which have about a hundred layers (e.g., 96 and 105 layers in GPT3-175B and Megatron-Turing-530B respectively), ZeRO-Inference reduces the GPU memory requirements by up to two orders of magnitude. For example, with ZeRO-Inference, GPU memory consumption of Megaton-Turing-530B for half-precision inference drops from 1TB to 10GB. Second, by fitting the model into CPU or NVMe memory which are orders of magnitude cheaper than GPU memory, ZeRO-Inference makes scaling to future SOTA models (e.g., with trillions or tens-of-trillions of parameters) more affordable compared to approaches that fit the entire model into GPU memory.
ZeRO-Inference delivers efficient computation for throughput-oriented inference applications despite the latency of fetching model weights from CPU or NVMe over PCIe interconnect. The primary reason for this is that by limiting GPU memory usage of the model to one or a few layers of weights, ZeRO-Inference can use the majority of GPU memory to support a large amount of input tokens in the form of long sequences or large batch sizes. A large model layer requires a significant amount of computation, especially when processing inputs with many input tokens. For example, one GPT3-175B layer requires about 7 TFlops to process an input of batch size 1 and sequence length of 2048. Therefore, for inference scenarios with long sequence length and large batch sizes, the computation time dominates the latency of fetching model weights, which ultimately improves efficiency. In summary, ZeRO-Inference’s strategy to utilize GPU memory to support large number of input tokens results in high performance inference for large models.
To further improve system efficiency, ZeRO-Inference leverages two additional optimizations to reduce the latency of fetching layer weights from CPU or NVMe memory into GPU memory.
The first optimization involves overlapping the fetch of a layer with the computation of an earlier layer, a.k.a., layer prefetching. Layer prefetching allows ZeRO-Inference to hide portions of the transfer latency of the prefetched layers. This is especially useful when computation time is not large enough or cannot be sufficiently increased (e.g., with larger batch size) to dominate the latency of fetching layer weights.
The second optimization, which is applicable for inference on multiple GPUs, involves parallelizing the fetch of each layer across multiple GPUs by using each GPU to fetch only a portion of the layer. Employing the aggregate PCIe links of the GPUs in this manner essentially increases the transfer bandwidth linearly, thus reducing the latency. With this approach, fetching layers into GPU memory occurs in two phases. First, each GPU independently fetches a partition of the layer over PCIe into its memory. At this point, only a partition of the layer will be resident on each GPU. Next, each GPU assembles the full layer for computation by fetching the missing layer pieces from other GPUs over the high-bandwidth GPU-GPU interconnect (e.g., NVLink, xGMI, etc.). Since GPU-GPU interconnect bandwidth is typically over an order of magnitude higher than PCIe bandwidth, efficient multi-GPU or multi-node communication primitives, such as NCCL or RCCL all-gather, can be used to efficiently assemble the full layer on all GPUs with negligible latency compared to the PCIe latency.
An alternative approach to ZeRO-Inference is to pin as many of the model weights as possible into GPU memory and fetch the remainder (from CPU or NVMe) when needed for computation. A benefit of this approach is avoidance of the latency of fetching weights that are already pinned in GPU memory. However, this approach has two downsides: (i) the latency savings for hundred-billion parameter models are negligible since only a small fraction of the weights can fit in GPU memory, and (ii) even when a decent portion of the model weights can fit (e.g., > 50% for ~10B models), the remaining GPU memory can only fit small batch sizes which hurts inference throughput. We later show evaluation results to demonstrate that this approach is sub-optimal.
ZeRO-Inference enables significant model scaling for inference on a single GPU compared to a baseline that hosts the model in GPU memory (i.e., HBM). As an example, we consider half-precision model inference using a single NVIDIA Tesla V100 GPU in a NVIDIA DGX2 system. While the V100 GPU has 32GB of memory, the system is equipped with 1.5TB of CPU DRAM and 30TB of NVMe storage. The maximum model size supported for inference computation on GPU depends on the memory in which the model is hosted. Figure 1 below shows the achievable model scales in this system for GPU inference with ZeRO-Inference. In comparison, the baseline cannot support models larger than 16 billion parameters for GPU inference1. In contrast, ZeRO-Inference has the flexibility to host the model in a different memory (DRAM or NVMe) than HBM. This flexibility allows ZeRO-Inference to support much larger models than baseline. For example, by hosting a model on NVMe memory, Zero-Inference can support models with up to 15 trillion parameters for GPU inference, which is almost a thousand times larger compared to baseline. A practical takeaway from Figure 1 is that ZeRO-Inference enables single GPU inference computation of current SOTA models, since they are smaller than 15 trillion parameters.
An important inference workload is token generation based on an input prompt. In this workload the model is provided a text sequence as input prompt, and based on this prompt, the model generates output text of configurable length. We use this workload to demonstrate the performance of ZeRO-Inference. This workload consists of two phases: (1) the prompt processing phase where the model processes the input prompt, and (2) the generation phase where the model generates the output tokens.
ZeRO-Inference is targeted for throughput-oriented inference applications, and so the performance metric that we use for this workload is the number of tokens generated per second in the generation phase. We use the Hugging Face token generation pipeline in our experiments to measure the performance of using a greedy search algorithm to generate ten output tokens given an input prompt of four tokens. The generation pipeline in our experiments uses KV-caching optimization to improve performance by caching generated tokens to avoid re-computation. We consider the performance impact of three aspects of ZeRO-Inference design choices and optimizations: (1) full offloading model weights as opposed to partial offloading, (2) prefetching layer weights ahead of use, and (3) using multiple GPUs to parallelize layer fetching over PCIe. Additionally, we measure the performance impact of varying the number of output tokens.
For our experiments, we use the three publicly available massive language models listed in Table 1. We configure these models for half-precision inference computations. ZeRO-Inference is required to inference these models on a single V100-32GB since they are bigger than GPU memory.
A key design choice in ZeRO-Offload is to offload all the weights of models larger than GPU memory rather than host a subset of the weights in GPU memory. Our intuition for this approach is that for throughput-oriented inference applications, the larger batch sizes enabled by full offload yields better performance than partial offload. In Table 2, we present results for OPT-30B token generation on a single V100-32GB that compare fully offloading the model weights versus hosting a portion (i.e., 10 and 12 billion parameters2) in GPU memory. The results show that full offload delivers the best performance for both CPU memory (43 tokens per second) and NVMe memory (30 tokens per second). With both CPU and NVMe memory, full offload is over 1.3x and 2.4x faster than partial offload of 18 and 20 billion parameters respectively. The performance advantage of full offload comes from the larger batch sizes compared to the partial offload options. Thus when a model does not fit in GPU, using GPU memory to increase batch size rather than to partially fit the model leads to faster token generation.
ZeRO-Inference fetches layers ahead of use, overlapping with current layer computation, to hide layer transfer latency. We measure the impact of prefetching on token generation performance on a single V100-32GB and summarize the results in Table 3. We observe that prefetching did not improve CPU offload. This is because the relatively short sequences in token generation (i.e., less than 50 tokens) resulted in layer computation time that is insufficient to hide a significant portion of layer fetch time from CPU. In contrast, prefetching improves NVMe offloading performance by 1.13x, 1.14x and 1.21x for OPT-30B, OPT-175B, and BLOOM-176B respectively. This is because transferring weights from NVMe through CPU memory allows prefetching to overlap transfers from CPU to GPU memory with transfers from NVMe to CPU boosting the effective transfer bandwidth.
ZeRO-Inference leverages the four PCIe interconnects between GPUs and CPU memory to parallelize layer fetching for faster inference computations on multiple GPUs. In Table 4, we report the throughput improvements for token generation on two and four GPUs compared to a single GPU3 . These results were collected with layer prefetching enabled. The reported throughput numbers are per GPU showing that token generation becomes faster on each GPU as the aggregated PCIe links reduce the layer fetch latencies. The improved per GPU throughput translates to super-linear scaling performance. Additionally, these results suggest improved bandwidths of future PCIe generations could help to improve ZeRO-Inference performance.
We measure the performance impact of the number of output tokens since the memory overhead of KV-caching optimization increases with longer output tokens and could limit batch size. First, we consider the impact of token lengths 10, 20, 50, and 100 on batch size that can fit one V100-32GB GPU. The results in Table 5 show a 2X reduction in batch size for a 5X increase in token count (compared to baseline count of 10).
Next, we measure the impact on generation throughput using four V100-32GB GPUs. The results are presented in Table 6 for CPU offload, and Table 7 for NVMe-Offload. We observe an impact that is consistent across models and offload memory, which is that increasing the number of output tokens reduces throughput proportionally to batch size reduction. These results also demonstrate the importance of large batch sizes to the performance of ZeRO-Inference.
We briefly discuss how users can determine when ZeRO-Inference is suitable for their application and how to enable ZeRO-Inference in DeepSpeed.
ZeRO-Inference is designed for inference applications that require GPU acceleration but lack sufficient GPU memory to host the model. Also, ZeRO-Inference is optimized for inference applications that are throughput-oriented and allow large batch sizes. Alternative techniques, such as Accelerate, DeepSpeed-Inference, and DeepSpeed-MII that fit the entire model into GPU memory, possibly using multiple GPUs, are more suitable for inference applications that are latency sensitive or have small batch sizes.
ZeRO-Inference is available in the DeepSpeed library versions >= 0.6.6. Integrating ZeRO-Inference into token generation pipelines, such as Hugging Face generate, requires updating the DeepSpeed configuration to set ZeRO optimization to stage 3 and parameter offloading to CPU or NVMe.
Below is a configuration snippet for enabling ZeRO-Inference with offloading to CPU memory.
Below is a configuration snippet for offloading to a NVMe device mounted on “/local_nvme”.
Recent advances in AI technology have primarily come from extreme scaling of model sizes. However, extreme model scaling has also made the hardware cost of training and inferencing prohibitive for all but the largest organizations, severely restricting access to AI innovations. To help democratize AI, we developed ZeRO-Inference, a technology that enables inference computations of massive models on as few as a single GPU. ZeRO-Inference reduces the GPU cost of SOTA model inference by hosting the model on CPU or NVMe memory and streaming the model layers into GPU memory for inference computation. ZeRO-Inference complements the democratization efforts of large organizations that publicly release pre-trained SOTA models by ensuring that inference computation of these models is affordable for most users (e.g., students, hobbyists, model scientists, etc.).
The DeepSpeed team would like to acknowledge Stas Bekman for previewing this blog and providing valuable feedback.
16 billion parameters model won’t fit in V100-32GB for half-precision inference since no memory will be left for inputs and intermediate results. ↩
Pinning more parameters in GPU memory resulted in out of memory errors for small batch sizes. ↩
For multiple GPU runs, we select GPUs with independent PCIe interconnects to CPU memory. ↩
Updated: September 9, 2022
**Examples:**
Example 1 (unknown):
```unknown
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
...
},
...
}
```
Example 2 (unknown):
```unknown
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
...
},
...
}
```
---
```
### references/2020.md
```markdown
# Deepspeed - 2020
**Pages:** 16
---
## DeepSpeed Microsoft Research Webinar is now on-demand
**URL:** https://www.deepspeed.ai/2020/08/06/webinar-on-demand.html
**Contents:**
- DeepSpeed Microsoft Research Webinar is now on-demand
- Contents
Updated: August 6, 2020
---
## An Order-of-Magnitude Larger and Faster Training with ZeRO-2
**URL:** https://www.deepspeed.ai/2020/05/18/zero-stage2.html
**Contents:**
- An Order-of-Magnitude Larger and Faster Training with ZeRO-2
ZeRO-2 expands the scope of memory optimizations in the original ZeRO by tackling the full spectrum of memory consumption during training. More specifically, ZeRO-2 introduces new technology to reduce the memory footprint of gradients, activation memory, and fragmented memory, in addition to optimizer state memory optimization in the original ZeRO. Altogether, the memory savings empower DeepSpeed to improve the scale and speed of deep learning training by an order of magnitude. More concretely, ZeRO-2 allows training models as large as 170 billion parameters up to 10x faster compared to state of the art.
For more information on ZeRO-2, see our blog post.
For more information on how to use ZeRO-2, see an example of training GPT family of models in this tutorial.
For a technical overview, see our technical report.
Updated: May 18, 2020
---
## 10x bigger model training on a single GPU with ZeRO-Offload
**URL:** https://www.deepspeed.ai/2020/09/08/ZeRO-Offload.html
**Contents:**
- 10x bigger model training on a single GPU with ZeRO-Offload
We introduce a new technology called ZeRO-Offload to enable 10X bigger model training on a single GPU. ZeRO-Offload extends ZeRO-2 to leverage both CPU and GPU memory for training large models. Using a machine with a single GPU, our users now can run models of up to 13 billion parameters without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
Updated: September 8, 2020
---
## Progressive Layer Dropping
**URL:** https://www.deepspeed.ai/2020/10/28/progressive-layer-dropping-news.html
**Contents:**
- Progressive Layer Dropping
We introduce a new technology called progressive layer dropping (PLD) to speedup the pre-training of Transformer-based networks through efficient and robust compressed training. The pre-training step of Transformer networks often suffer from unbearable overall computational expenses. We analyze the training dynamics and stability of Transformer networks and propose PLD to sparsely update Transformer blocks following a progressive dropping schedule, which smoothly increases the layer dropping rate for each mini-batch as training evolves along both the temporal and the model depth dimension. PLD is able to allow the pre-training to be 2.5X faster to get similar accuracy on downstream tasks and allows the training to be 24% faster when training the same number of samples, not at the cost of excessive hardware resources.
Updated: October 28, 2020
---
## ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale
**URL:** https://www.deepspeed.ai/2020/05/18/press-release.html
**Contents:**
- ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale
- Contents
Updated: May 18, 2020
---
## ZeRO stage 1 with reduced communication
**URL:** https://www.deepspeed.ai/2020/03/17/reduce-scatter.html
**Contents:**
- ZeRO stage 1 with reduced communication
- Contents
- Further updates coming soon!
Updated: March 17, 2020
---
## Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention
**URL:** https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html
**Contents:**
- Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention
DeepSpeed offers sparse attention kernels, an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5-3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.
Updated: September 8, 2020
---
## ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters
**URL:** https://www.deepspeed.ai/2020/02/13/release.html
**Contents:**
- ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters
- Contents
Updated: February 13, 2020
---
## Microsoft DeepSpeed achieves the fastest BERT training time
**URL:** https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html
**Contents:**
- Microsoft DeepSpeed achieves the fastest BERT training time
- Contents
- Performance Results for BERT Pretraining
- Performance Results for Fine-Tuning Tasks
- BERT Highly Optimized Transformer Kernels
- (a) Advanced fused kernels to reduce data movement
- (b) Invertible operators to save memory and run large batches
- Overlapping I/O with Computation through Asynchronous Prefetching Queue
- Exploiting Sparsity of BERT’s Output Processing
- Pre-LayerNorm vs Post-LayerNorm Architecture
Good news! DeepSpeed obtains the fastest BERT training record: 44 minutes on 1024 NVIDIA V100 GPU. This is a 30% improvement over the best published result of 67 mins in end-to-end training time to achieve the same accuracy on the same number and generation of GPUs. This improvement does not come at the cost of excessive hardware resources but comes from improved software efficiency. For example, DeepSpeed can attain a staggering 64 teraflops of single GPU performance on a NVIDIA V100 GPU which is over 50% of the hardware peak.
In this blog post, we will discuss four technological improvements that enable DeepSpeed to achieve this record-breaking BERT training time.
These optimizations not only benefit BERT; they are also applicable to many other transformer-based models such as RoBERTa, XLNet, and UniLM. Furthermore, besides the improvements mentioned for pre-training, DeepSpeed achieves up to 1.5x speedups for the downstream tasks, such as the fine-tuning for Bing-BERT SQuAD.
Compared to SOTA, DeepSpeed significantly improves single GPU performance for transformer-based model like BERT. Figure 1 shows the single GPU throughput of training BERT-Large optimized through DeepSpeed, comparing with the two well-known PyTorch implementations from NVIDIA BERT and Hugging Face BERT. DeepSpeed reaches as high as 64 and 53 teraflops throughputs (corresponding to 272 and 52 samples/second) for sequence lengths 128 and 512, respectively, exhibiting up to 28% throughput improvements over NVIDIA BERT and up to 62% over HuggingFace BERT. We also support up to 1.8x larger batch size without running out of memory.
To achieve this performance, DeepSpeed implements a stochastic transformer which exhibits some level of non-deterministic noise without affecting overall convergence. In addition, DeepSpeed also implements a deterministic transformer kernel that is completely reproducible at the expense of a small performance regression of approximately 2% on average. Users can easily choose and switch between the two versions depending on their usage scenarios: Stochastic version pursues ultimate training performance goal, and deterministic version may save development time by better facilitating experimentation and debugging. We report performance numbers for both these kernels in Figure 1. The performance numbers were collected with a gradient accumulation step of 10 for all batch sizes and configurations, since on average an overall batch size used in practical scenarios range from a few hundred to a few thousand.
Figure 1: Performance evaluation of BERT-Large on a single V100 GPU, comparing DeepSpeed with NVIDIA and HuggingFace versions of BERT in mixed-sequence length training. The labeled points show the highest throughput of each implementation in teraflops (Tflops). DeepSpeed boosts throughput and allows for higher batch sizes without running out-of-memory.
Looking at distributed training across GPUs, Table 1 shows our end-to-end BERT-Large pre-training time (F1 score of 90.5 for SQUAD) using 16 to 1024 GPUs. We complete BERT pre-training in 44 minutes using 1024 V100 GPUs (64 NVIDIA DGX-2 nodes). In comparison, the previous SOTA from NVIDIA takes 47 mins using 1472 V100 GPUs. DeepSpeed is not only faster but also uses 30% less resources. Using the same 1024 GPUS,NVIDIA BERT takes 67 minutes using the same 1024 GPUs [1] BERT, whereas DeepSpeed takes 44 minutes, reducing training time by 30%. Similarly, on 256 GPUs, NVIDIA BERT takes 236 minutes while DeepSpeed takes 144 minutes (39% faster).
Table 1: BERT-Large training time using 1 to 64 DGX-2’s with DeepSpeed.
At the recent GTC 2020, NVIDIA announced the next generation hardware A100, which now offers 2.5X hardware peak performance over the V100 GPU. Assuming the A100 GPU allows us to obtain the same percentage of hardware peak performance (50%) as we obtained on V100 GPUs, we expect to obtain even higher throughput by combining our software optimizations with the new hardware. We project it would reduce BERT training time further to less than 25 minutes on a cluster of 1024 A100 GPUs.
In addition to the performance benefits we show for the pre-training, we have evaluated the performance of our customized kernel for fine-tuning the downstream tasks. Tables 2 and 3 show the samples-per-second achieved when running Bing-BERT SQuAD on NVIDIA V100 using 16 and 32 GB of memory, using PyTorch and DeepSpeed transformer kernels. For the 16-GB V100, we can achieve up to 1.5x speedup while supporting 2x larger batch size per GPU. On the other hand, we can support as large as 32 batch size (2.6x more than Pytorch) using 32GB of memory, while providing 1.3x speedup for the end-to-end fine-tune training. Note, that we use the best samples-per-second to compute speedup for the cases that PyTorch runs out-of-memory (OOM).
Table 2. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (16-GB) using PyTorch and DeepSpeed transformer kernels.
Table 3: Samples/second for running SQuAD fine-tuning on NVIDIA V100 (32-GB) using PyTorch and DeepSpeed transformer kernels.
GPUs have very high peak floating-point throughput, but the default Transformer blocks in most framework implementations are far from reaching this peak. Figure 2 shows the structure of a Transformer block with the LayerNorm placed on the input stream of the two sublayers: Attention and Feed-Forward. To approach the GPU peak performance, we employ two lines of optimizations in our own Transformer kernel implementation: advanced fusion, and invertible operators.
Figure 2: Transformer Layer with Pre-LayerNorm Architecture
We observe that transformer-based networks trigger many invocations of CUDA kernels operating in a producer-consumer fashion, adding a lot of cost for transferring data to and from global memory and overhead from kernel launching. Existing compiler-based approaches perform fine-grained fusion (e.g., fusion of element-wise operations), leading to missed fusion opportunities. In contrast, we fully exploit both fine-grain and coarse-grained fusion, tailored for Transformer blocks.
QKV and various fusions. We merge the three Query (Q), Key (K), and Value (V) weight matrices to dispatch a larger QKV GEMM to expose more parallelism and improve data locality on GPU’s shared memory and register files, as shown in Figure 3. Next, we combine the data-layout transformation of the QKV’s output matrix with the bias addition. We then partition the large QKV matrix into three transformed ones, used for the following self-attention computation.
As Figure 3 illustrates, we read the QKV matrix in consecutive rows (shown by red box), and write them in the three transformed Q, K, and V matrices. Since each matrix starts from a different offset, we may have uncoalesced access to the main memory. Thus, we use the shared memory as an intermediate buffer, in order to rearrange the data in a way that we can put the data in consecutive parts of memory. Even though we produce an uncoalesced pattern when accessing shared memory, we reduce the cost of uncoalesced access to main memory to better exploit memory bandwidth, resulting in 3% to 5% performance improvement in the end-to-end training.
Figure 3: QKV’s GEMM and transform Kernel-Fusion
We perform additional fusions such as merging the addition of bias from the attention-output GEMM with the addition from the residual connection and also dropout, which allows accesses to happen in the register files and shared memory, which are orders of magnitude faster than the expensive write-back to the global memory.
Warp-level communication. To alleviate the synchronization overhead among parallel GPU cores and further increase the resource utilization of the fused kernels, we use the warp-level (data shuffle instructions) instead of the default inter-warp communication. Take the layer-normalization and SoftMax kernel as examples, we perform each reduction operation inside a warp, while distributing different reductions across different warps. This way, we alleviate the synchronization among the parallel threads and further increase the GPU resource utilization.
Stochastic vs deterministic kernels. DL training is generally robust to some level of stochasticity, and in some cases, controlled noises such as dropouts act as regularizer which improve generalization. In designing our transformer kernel, we embrace some level of stochasticity to improve throughput by allowing for limited data race conditions to exist in the kernel: We leverage implicit warp synchronous programming to achieve higher performance for the warp-level cooperative operations [3]. The lack of explicit warp level synchronization act as non-deterministic noise without affecting the overall convergence behavior of the transformer kernels while giving a decent throughput boost.
In addition, DeepSpeed also implements a non-stochastic transformer kernel with explicit warp synchronization that produces deterministic results at the expense of a small performance regression. Users can easily choose and switch between the two versions depending on their usage scenarios: Stochastic version pursues ultimate training performance goal, and deterministic version may save development time by better facilitating experimentation and debugging.
In our experiments, we use stochastic kernels for the pre-training BERT, while using non-stochastic kernels for fine-tuning to achieve fully reproducible results. We recommend using stochastic kernels for training tasks involving massive amounts of data such as pre-training, while using non-stochastic version when training with limited data such as in the case of fine-tuning for more consistent results.
Cost-effective rematerialization. When fusing kernels of the different operations, we observe that some operators are inexpressive to compute but incur expensive data movement cost, such as addition of bias and dropout. For these operations, we avoid saving their results in the forward pass, but instead recompute them during the backward pass, which turns out to be much faster than having their results written and reloaded from the main memory.
We also observe that the intermediate activations from several operators in the Transformer blocks incur a large memory consumption, such as SoftMax and Layer Norm. For these operators, we drop the inputs to these layers to reduce the footprint of activation memory, by leveraging the fact that they are invertible functions, which are functions whose backward pass is independent of the inputs and can be formulated based only on the outputs [2]. Figure 4 and Figure 5 show the examples of the original implementation of SoftMax and Layer-Norm in PyTorch versus the invertible SoftMax implementation in DeepSpeed. Through this optimization, we are able to reduce the activation memory of the operator by half, and the reduced memory allows us to train with larger batch sizes, which once again improves GPU efficiency.
Figure 4: DeepSpeed invertible SoftMax operation versus Default PyTorch SoftMax operation
Figure 5: DeepSpeed invertible LayerNorm operation versus Default PyTorch LayerNorm operation
Beyond highly optimized transformer kernels, the BERT training has other performance limiting factors, e.g., data loading. We develop our own asynchronous worker which prefetches batches of data into a queue only at “safe points” – points when the CPUs are idle (e.g., right after asynchronously launching the forward pass). In this way, we make sure that there is no dequeuing and copying data from CPU to GPU when there is computation on the CPU side. This is different from the default PyTorch data loader, which can prefetch data at any points and cause performance interference. By using this method, we hide almost all I/O overhead, which accounts for 4% of the original training time.
We improve the end-to-end training time by 5.4% by recognizing and exploiting sparsity in BERT’s output processing. The output processing involves two steps: i) BERT projection from the hidden output dimension of the final transformer layer to the language vocabulary, using a matrix-matrix multiplication, and ii) a cross-entropy of the masked output tokens to the get each sequence’s prediction error. The cost of the first step is proportional to the vocabulary size, hidden output dimension and the sequence length, and can be as expensive as a transformer layer computation or more. However, only about 15% of the tokens are masked, and we only need the cross-entropy for the masked tokens. Therefore, the projection can be done as an efficient sparse computation. To do so, we discard the rows of the final transformer layer that corresponding to the non-masked tokens before doing the projection, reducing the computation cost of output processing by 85%.
We observe that with large batch size (e.g., 64K) the default BERT pre-training suffers from training instability, which can result in model divergence or convergence to bad/suspicious local optima. Further investigation shows that the default BERT has vanishing gradients issue. To mitigate the issue, we changed the placement of LayerNorm (Post-LayerNorm) by placing it only on the input stream of the sublayers in the Transformer block (called Pre-LayerNorm), a modification described by several recent works for neural machine translation. The Pre-LayerNorm results in several useful characteristics such as avoiding vanishing gradient, stable optimization, and performance gain. It allows us to train at aggregated batch size of 64K with increased learning rate and faster convergence.
To try out these optimizations and training recipe, please check out our BERT training tutorial and source code at the DeepSpeed GitHub repo.
[1] “NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model, Paving Path For Advanced Conversational AI” https://devblogs.nvidia.com/training-bert-with-gpus/.
[2] S. R. Bulo, L. Porzi, and P. Kontschieder, “In-place activated batch norm for memory-optimized training of dnns” 2017. http://arxiv.org/abs/1712.02616.
[3] Mark Harris and Kyrylo Perelygin, “Cooperative Groups: Flexible CUDA Thread Programming”, https://devblogs.nvidia.com/cooperative-groups/.
Updated: May 27, 2020
---
## Training a Trillion Parameters with Pipeline Parallelism
**URL:** https://www.deepspeed.ai/2020/09/08/pipeline-parallelism.html
**Contents:**
- Training a Trillion Parameters with Pipeline Parallelism
- Contents
DeepSpeed includes new support for pipeline parallelism! DeepSpeed’s training engine provides hybrid 3D parallelism for training models with over a trillion parameters. In addition to scaling to the extreme, we have demonstrated that hybrid parallelism accelerates training on clusters with low-bandwidth network by up to 7x.
Updated: September 8, 2020
---
## Turing-NLG: A 17-billion-parameter language model by Microsoft
**URL:** https://www.deepspeed.ai/2020/02/13/turing-nlg.html
**Contents:**
- Turing-NLG: A 17-billion-parameter language model by Microsoft
- Contents
Updated: February 13, 2020
---
## Up to 5x less communication and 3.4x faster training through 1-bit Adam
**URL:** https://www.deepspeed.ai/2020/09/08/onebit-adam-news.html
**Contents:**
- Up to 5x less communication and 3.4x faster training through 1-bit Adam
Adam is an effective and probably the most well-utilized optimizer for training many large-scale deep learning models. However, Adam is generally not compatible with communication-efficient optimization algorithms, and therefore the communication cost could become a bottleneck while scaling across distributed devices. We introduce a new algorithm - 1-bit Adam - and its efficient implementation in DeepSpeed. 1-bit Adam offers the same convergence as Adam, incurs up to 5x less communication that enables up to 3.5x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning on bandwidth-limited clusters.
Updated: September 8, 2020
---
## DeepSpeed Sparse Attention
**URL:** https://www.deepspeed.ai/2020/09/08/sparse-attention.html
**Contents:**
- DeepSpeed Sparse Attention
- Contents
- Performance Results
Attention-based deep learning models such as the transformers are highly effective in capturing the relationship between tokens in an input sequence, even across long distances. As a result, they are used with text, image, and sound-based inputs, where the sequence length can be in thousands of tokens. However, despite the effectiveness of attention modules to capture long term dependencies, in practice, their application to long sequence input is limited by compute and memory requirements of the attention computation that grow quadratically, O(n^2), with the sequence length n.
To address this limitation, DeepSpeed offers a suite of sparse attention kernels –an instrumental technology that can reduce the compute and memory requirement of attention computation by orders-of-magnitude via block-sparse computation. The suite not only alleviates the memory bottleneck of attention calculation, but also performs sparse computation efficiently. Its APIs allow convenient integration with any transformer-based models. Along with providing a wide spectrum of sparsity structures, it has the flexibility of handling any user-defined block-sparse structures. More specifically, sparse attention (SA) can be designed to compute local attention between nearby tokens, or global attention via summary tokens computed with local attention. Moreover, SA can also allow random attention, or any combination of local, global, and random attention as shown in the following figure with blue, orange, and green blocks, respectively. As a result, SA decreases the memory footprint to O(wn), in which 1 < w < n is a parameter, whose value depends on the attention structure.
This library is PyTorch based and develops required kernels through Triton platform; kernels are not written in CUDA, which leaves the door open for CPU/OpenCL/Vulkan support in the future. The library is an extension to DeepSpeed and can be used through DeepSpeed as well as stand alone. Block-sparse computations handled by DeepSpeed Sparse Attention kernels are illustrated in following figures for forward and backward passes respectively. In the figures, S stands for a block-sparse matrix and D a dense matrix.
To learn more about Sparsity Config, and also how to use this library, please check our tutorial that provides detailed information about it.
We also define a template to have variable structure (top figure), which can be used to simply customize any block-sparse random/local/global attention pattern. In addition to this list, user can add any other sparsity structure as described in tutorial section.
Updated: September 8, 2020
---
## The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels
**URL:** https://www.deepspeed.ai/2020/05/18/bert-record.html
**Contents:**
- The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels
We introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 66 teraflops per V100 GPU, which is 52% of the hardware peak. Using optimized transformer kernels as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.
Updated: May 18, 2020
---
## DeepSpeed Microsoft Research Webinar on August 6th, 2020
**URL:** https://www.deepspeed.ai/2020/07/23/deepspeed-webinar.html
**Contents:**
- DeepSpeed Microsoft Research Webinar on August 6th, 2020
- Contents
Updated: July 23, 2020
---
## DeepSpeed with 1-bit Adam: 5x less communication and 3.4x faster training
**URL:** https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html
**Contents:**
- DeepSpeed with 1-bit Adam: 5x less communication and 3.4x faster training
- Contents
- 1. Introduction
- 1.1 Background: Classic compression techniques
- 1.2 Challenges in applying error-compensation to Adam
- 2. Compressing communication with 1-bit Adam
- 2.1 How 1-bit Adam works under the hood
- 2.2 Addressing system challenges for 1-bit Adam
- 3. Benefits of 1-bit Adam on communication-constrained systems
- 4. Dive deeper into 1-bit Adam evaluation results
Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like Stochastic Gradient Descent (SGD) and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offers state-of-the-art convergence efficiency and accuracy for many tasks, including training of BERT-like models. For a powerful optimizer like ADAM, the non-linear dependency on gradient (in the variance term) makes it challenging to develop error compensation-based compression techniques, limiting the practical value of the state-of-the-art communication compression techniques.
One way of communication compression is 1-bit compression, which can be expressed as:
With this compression, we could achieve a 32x reduction of memory size by representing each number using one bit. The problem is that using this straightforward method would significantly degrade the convergence speed, which makes this method inapplicable. To solve this problem, recent studies show that by using error compensation compression, we could expect almost the same convergence rate with communication compression. The idea of error compensation can be summarized as: 1) doing compression, 2) memorizing the compression error, and then 3) adding the compression error back in during the next iteration. For SGD, doing error compression leads to:
Where C(⋅) is the 1-bit compression operator. The good thing about doing this error compensation is that the history compression error (e_t and e_(t-1)) would be canceled by itself eventually, which can be seen by:
This strategy has been proven to work for optimization algorithms that are linearly dependent on the gradient, such as SGD and Momentum SGD.
We provide an overview of the Adam algorithm below. The update rules are as follows.
As shown in the equations above, the variance term v_t is nonlinearly dependent on the gradient g_t. If we apply basic error compensation compression to Adam, we observe that Adam will not converge as shown in Figure 1.
Figure 1: Inapplicability of Error-compensation Compression for Adam due to non-linear dependence on the gradient
To compress communication while using the Adam optimizer, we develop 1-bit Adam, which addresses the non-linearity in gradients via preconditioning. We observe that the magnitude of changes on the non-linear term, variance ( v_t), decrease significantly after a few epochs of training and setting v_t constant afterwards will not change the convergence speed. The proposed 1-bit Adam optimizer, as shown in Figure 2, consists of two parts: the warmup stage, which is essentially the vanilla Adam algorithm; and the compression stage, which keeps the variance term constant and compresses the remaining linear term, that is the momentum, into 1-bit representation.
The compression stage of the algorithm is controlled by a threshold parameter (as shown in Figure 2). When we detect that the change in “variance” falls below a certain threshold, we switch to the compression stage. Our study shows that only 15-20% of the overall training steps are needed for the warmup stage.
Figure 2: Comparison of distributed training steps in classic Adam and the proposed 1-bit compressed Adam algorithm
The weight update rule for 1-bit Adam is governed by the following equations.
For the i-th worker, in the compression stage:
Where x_t is the model after iteration (t-1), m_t^(i), e_t^(i) are the momentum and compression error on worker i after iteration (t-1), and v_warmup is the variance term after the warmup stage.
Besides the algorithmic challenge, there are two system challenges in applying 1-bit Adam in training systems. First, we need efficient kernels that convert the momentum to 1-bit representations. Second, we need efficient communication schemes to exchange this compressed momentum across different GPUs. The goal of compression is to reduce the overall training time so that commodity systems with bandwidth-limited interconnects can be used to train large models. We address these challenges in DeepSpeed and introduce a fully optimized 1-bit Adam implementation for training on communication-constrained systems.
1-bit Adam offers the same convergence as Adam, incurs up to 5x less communication that enables up to 3.5x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning. This end-to-end throughput improvement is enabled by the 6.6x (Figure 3) and 6.2x (Figure 4) speedup observed during the compression stage. It is worth mentioning that our 1-bit Adam optimizer scales so well on a 40 Gigabit Ethernet system that its performance is comparable to Adam’s scalability on a 40 Gigabit InfiniBand QDR system. We note that the effective bandwidth on 40 Gigabit Ethernet is 4.1 Gbps based on iperf benchmarks whereas InfiniBand provides near-peak bandwidth of 32Gbps based on InfiniBand perftest microbenchmarks.
Figure 3: Scalability of 1-bit Adam for BERT-Large Pretraining on V100 GPUs with batch size of 16/GPU.
Figure 4: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU.
One major question for using 1-bit Adam is the convergence speed, and we find that 1-bit Adam can achieve the same convergence speed and comparable testing performance using the same number of training samples as shown in Figure 5.
Figure 5: 1-bit Adam converges like Adam using the same number of training samples.
Detailed BERT-Base and BERT-Large results are shown in Table 1. We see that the scores are on par with or better than the original model for both the uncompressed and compressed cases.
Table 1: Verifying correctness of 1-bit Adam on various testing tasks
Up to 5x less communication: 1-bit Adam provides the same convergence as Adam and reduces the communication volume by 16x during the compression stage for 16-bit (FP16) training. For BERT pretraining, this leads to an overall communication reduction of 5x as we observed the warmup stage to be just 15% of the end-to-end training time.
The formula to calculate the communication volume ratio of the original versus 1-bit Adam is as follows:
In the case of warmup equaling 15%, original Adam incurs 5x of the communication as 1-bit Adam.
We present two main results for training BERT-Large on systems with two different bandwidth-limited interconnects: 1) 40 gigabit Ethernet (Figure 5) and 2) 40 gbps InfiniBand QDR (Figure 6). During the compression phase, we observe up to 6.6x higher throughput on the system with Ethernet and up to 2x higher throughput on the system with InfiniBand, resulting in end-to-end speed up (including both warmup and compression stages) of 3.5x and 2.7x, respectively. The major benefit of 1-bit Adam comes from the communication volume reduction—enabled by our compressed momentum exchange—and from our custom allreduce operation that implements efficient 1-bit communication using non-blocking gather operations followed by an allgather operation.
It is important to note that one can also increase total batch size to reduce communication using optimizers like LAMB instead of Adam for BERT pretraining. However, 1-bit Adam avoids the need for rigorous hyperparameter tuning, which is often more difficult for large batches from our experience. Furthermore, 1-bit Adam also works very well for workloads that have small critical batch size (cannot converge well with large batch size) like many fine-tuning tasks.
Figure 5: Performance of 1-bit Adam for BERT-Large training on 40 Gbps Ethernet interconnect during the compression stage.
Figure 6: Performance of 1-bit Adam for BERT-Large training on 40 Gbps InfiniBand interconnect during the compression stage.
1-bit Adam offers scalability not only on large-scale training tasks but also on tasks like SQuAD fine-tuning. As shown in Figures 7 and 8, 1-bit Adam scales well on both Ethernet- and InfiniBand-based systems and offers up to 6.2x higher throughput (during the compression stage) on the Ethernet-based system, resulting in 2.7x end-to-end speedup (25% warmup plus 75% compression stage). For SQuAD fine-tuning, we observed that a total batch size of 96 offers the best F1 score. Batch sizes larger than this value lower the convergence rate and require additional hyperparameter tuning. Therefore, in order to scale to 32 GPUs, we can only apply a small batch size of 3-4 per GPU. This makes fine-tuning tasks communication intensive and hard to scale. 1-bit Adam addresses the scaling challenge well, obtaining 3.4x communication reduction without enlarging batch size, and it results in a 2.7x end-to-end speedup.
Figure 7: Performance of 1-bit Adam for SQuAD fine-tuning on 40 gbps Ethernet during the compression stage.
Figure 8: Performance of 1-bit Adam for SQuAD fine-tuning on 40 gbps InfiniBand interconnect during the compression stage.
Updated: September 8, 2020
**Examples:**
Example 1 (unknown):
```unknown
1 / (warmup + (1 – warmup)/16)
```
---
```
### references/2023.md
```markdown
# Deepspeed - 2023
**Pages:** 21
---
## DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Image Inputs
**URL:** https://www.deepspeed.ai/2023/10/03/deepspeed-visualchat.html
**Contents:**
- DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Image Inputs
- Contents
Updated: October 3, 2023
---
## DeepSpeed4Science:利用先进的AI系统优化技术实现科学发现
**URL:** https://www.deepspeed.ai/2023/09/18/deepspeed4science-chinese.html
**Contents:**
- DeepSpeed4Science:利用先进的AI系统优化技术实现科学发现
- Contents
Updated: September 18, 2023
---
## DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
**URL:** https://www.deepspeed.ai/2023/08/23/ulysses.html
**Contents:**
- DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
- Contents
Updated: August 23, 2023
---
## DeepSpeed Ulysses: 训练极长序列Transformer模型的系统优化
**URL:** https://www.deepspeed.ai/2023/08/23/ulysses-chinese.html
**Contents:**
- DeepSpeed Ulysses: 训练极长序列Transformer模型的系统优化
- Contents
Updated: August 23, 2023
---
## DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍
**URL:** https://www.deepspeed.ai/2023/04/23/deepspeed-chat-chinese.html
**Contents:**
- DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍
- Contents
Updated: April 23, 2023
---
## DeepSpeed ZeRO++: LLMやチャットモデルの訓練を劇的に高速化 – 通信オーバヘッドを1/4に大幅削減 -
**URL:** https://www.deepspeed.ai/2023/06/21/zeropp-japanese.html
**Contents:**
- DeepSpeed ZeRO++: LLMやチャットモデルの訓練を劇的に高速化 – 通信オーバヘッドを1/4に大幅削減 -
- Contents
Updated: June 21, 2023
---
## DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
**URL:** https://www.deepspeed.ai/2023/11/05/deepspeed-fastgen.html
**Contents:**
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- Contents
Updated: November 5, 2023
---
## DeepSpeed-VisualChat: 複数ラウンド・複数画像の入力が可能なAIチャット体験を実現
**URL:** https://www.deepspeed.ai/2023/10/03/deepspeed-visualchat-japanese.html
**Contents:**
- DeepSpeed-VisualChat: 複数ラウンド・複数画像の入力が可能なAIチャット体験を実現
- Contents
Updated: October 3, 2023
---
## DeepSpeed-FastGen: MIIとDeepSpeed-InferenceによるLLMのための高速なテキスト生成
**URL:** https://www.deepspeed.ai/2023/11/05/deepspeed-fastgen-japanese.html
**Contents:**
- DeepSpeed-FastGen: MIIとDeepSpeed-InferenceによるLLMのための高速なテキスト生成
- Contents
Updated: November 5, 2023
---
## Zero Inference
**URL:** https://www.deepspeed.ai/2023/09/12/ZeRO-Inference.html
**Contents:**
- Zero Inference
- Contents
title: “ZeRO-Inference: 20X faster inference through weight quantization and KV cache offloading” excerpt: “” link: https://github.com/deepspeedai/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/README.md date: 2023-09-12 00:09:00 tags: inference ZeRO quantization English —
Updated: September 12, 2023
---
## DeepSpeed Ulysses: Transformerモデルを非常に長いシーケンスで訓練するための最適化
**URL:** https://www.deepspeed.ai/2023/08/23/ulysses-japanese.html
**Contents:**
- DeepSpeed Ulysses: Transformerモデルを非常に長いシーケンスで訓練するための最適化
- Contents
Updated: August 23, 2023
---
## DeepSpeed-VisualChat:多轮图像+文字,为你展现不一样的AI聊天魅力
**URL:** https://www.deepspeed.ai/2023/10/03/deepspeed-visualchat-chinese.html
**Contents:**
- DeepSpeed-VisualChat:多轮图像+文字,为你展现不一样的AI聊天魅力
- Contents
Updated: October 3, 2023
---
## DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication
**URL:** https://www.deepspeed.ai/2023/06/21/zeropp.html
**Contents:**
- DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication
- Contents
Updated: June 21, 2023
---
## Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies
**URL:** https://www.deepspeed.ai/2023/09/18/deepspeed4science.html
**Contents:**
- Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies
- Contents
Updated: September 18, 2023
---
## Scaling Large-Scale Generative Mixture-of-Expert Multimodal Model With VL-MoE
**URL:** https://www.deepspeed.ai/2023/03/30/multi-modal.html
**Contents:**
- Scaling Large-Scale Generative Mixture-of-Expert Multimodal Model With VL-MoE
- Contents
The field of Artificial Intelligence-Generated Content (AIGC) is rapidly growing, with the goal of making content creation more efficient and accessible. One of the most exciting areas of AIGC is the development of large-scale multi-modal models like Flamingo, BLIP, and GPT4, which can accept inputs from multiple resources, e.g., image, text, audio, etc., and generate a variety of formats as outputs. For example, image creation can be made through stable diffusion and DALLE using the prompt text, and the new feature in the coming Office can create slides with texts, images, animations, etc., by leveraging the power of the new Microsoft Office Copilot.
Scaling up the model size is one common approach to boost usability and capability of AIGC tasks. However, simply scaling up dense architectures (e.g., from GPT-1 to GPT-3) is usually extremely resource-intensive and time-consuming for both model training and inference. One effective way to tackle this challenge is to apply mixture of experts (MoE). In particular, recent text-based MoE and vision-based MoE studies have demonstrated that MoE models can significantly reduce the training and resource cost as compared to a quality-equivalent dense model, or produce a higher quality model under the same training budget. Up to now, the effectiveness of jointly training MoE for multi-modal models remains not well understood. To explore this important capability, DeepSpeed team is proud to announce our first large-scale generative mixture-of-expert (MoE) multimodal model, named VL-MoE.
Figure 1: The new encoding process in our VL-MoE for various modality inputs, for which gray and colored blocks indicate non-activated and activated modules, respectively.
Specifically, we incorporate the MoE structure into the classical single-tower multi-modal model by comprising of the following components: (1) a shared self-attention module across modalities, (2) a pool of modality-specific experts in the feed-forward network (FFN), and (3) a sparse gated MoE extended from the dense FFN. Subsequently, under the same amount of training resources as that used in VLMO (200k training steps), we demonstrate VL-MoE’s advantages over the state-of-the-art dense counterparts in the following two aspects:
(1) VL-MoE can achieve significant accuracy improvement in comparison to its dense counterparts. Table 1 demonstrates that under the same training budget (i.e., have the same number of activated parameters for each token), VL-MoE Base with 32 experts achieves better accuracy than the VLMO-Base dense model on all four vision-language datasets.
(2) VL-MoE achieves similar model quality with a much smaller activated number of parameters compared to its dense counterparts. Our results show that the finetuning performance of our VL-MoE is similar to that of the 3.1X larger VLMO-Large dense model (i.e., 3.1X more activated number of parameters per token). This can directly translate to approximately 3.1X training cost reduction as the training FLOPs for transformers are proportional to the activated model size per token.
Table 1: Comparison of finetuning accuracy results for different models used in vision-language classification tasks and image-text retrieval tasks.
A sophisticated MoE model design requires a highly efficient and scalable training system that can support multi-dimensional parallelism and efficient memory management. DeepSpeed MoE training system offers such advanced capabilities including easy-to-use APIs enabling flexible combinations of data, tensor, and expert parallelism. Furthermore, DeepSpeed MoE enables larger model scale than state-of-the-art systems by exploiting expert parallelism and ZeRO optimizations together. By leveraging the DeepSpeed MoE system, VL-MoE Base with 32 experts achieves similar model quality as VLMO-dense Large with about 2.5x training speedup.
DeepSpeed MoE system is already open-sourced and can be easily used as plug-and-play component to achieve high-performance low-cost training for any large-scale MoE models. The tutorial of how to use DeepSpeed MoE is available here. VL-MoE is currently in the process of being integrated as a model example of DeepSpeed Examples. Please stay tuned for our upcoming updates on this thread.
Updated: March 30, 2023
---
## DeepSpeed-FastGen:通过 MII 和 DeepSpeed-Inference 实现 LLM 高吞吐量文本生成
**URL:** https://www.deepspeed.ai/2023/11/05/deepspeed-fastgen-chinese.html
**Contents:**
- DeepSpeed-FastGen:通过 MII 和 DeepSpeed-Inference 实现 LLM 高吞吐量文本生成
- Contents
Updated: November 5, 2023
---
## DeepSpeed4Scienceイニシアティブ: 洗練されたAIシステムのテクノロジーにより大規模な科学的発見を可能に
**URL:** https://www.deepspeed.ai/2023/09/18/deepspeed4science-japanese.html
**Contents:**
- DeepSpeed4Scienceイニシアティブ: 洗練されたAIシステムのテクノロジーにより大規模な科学的発見を可能に
- Contents
Updated: September 18, 2023
---
## DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
**URL:** https://www.deepspeed.ai/2023/04/23/deepspeed-chat.html
**Contents:**
- DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
- Contents
Updated: April 23, 2023
---
## DeepSpeed ZeRO++:降低4倍网络通信,显著提高大模型及类ChatGPT模型训练效率
**URL:** https://www.deepspeed.ai/2023/06/21/zeropp-chinese.html
**Contents:**
- DeepSpeed ZeRO++:降低4倍网络通信,显著提高大模型及类ChatGPT模型训练效率
- Contents
Updated: June 21, 2023
---
## DeepSpeed主要技術の概要紹介
**URL:** https://www.deepspeed.ai/2023/06/06/deepspeed-overview-japanese.html
**Contents:**
- DeepSpeed主要技術の概要紹介
- Contents
我々が研究開発しているDeepSpeedについて、主要技術を日本語で説明した資料を公開しました。GPT3やChatGPTのような生成型AIのための大規模言語モデルを含む、様々な深層学習の訓練や推論に容易に適用でき、モデルの大規模化、高速化、コスト削減を可能にします。こちらよりダウンロードしてください。
Updated: June 6, 2023
---
## DeepSpeed Chat: ChatGPTライクなモデルを簡単・高速・低コストに、あらゆるスケールで学習
**URL:** https://www.deepspeed.ai/2023/04/23/deepspeed-chat-japanese.html
**Contents:**
- DeepSpeed Chat: ChatGPTライクなモデルを簡単・高速・低コストに、あらゆるスケールで学習
- Contents
Updated: April 23, 2023
---
```
### references/assets.md
```markdown
# Deepspeed - Assets
**Pages:** 29
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero1_dp8_1.5B_log.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/bert.png
---
##
**URL:** https://www.deepspeed.ai/assets/files/DeepSpeed_Overview_Japanese_2023Jun7th.pdf
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero_offload_dp1_10B_smi.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero3-offload-512-v100.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/data_efficiency/data_efficiecy_fig1.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zeropp/ZeRO-baseline.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/azure-cost.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/data_efficiency/data_efficiecy_fig0.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero3-offload-200B-scalability.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/hero.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero_offload_dp1_10B_log.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero2_dp32_10B_smi.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/data_efficiency/data_efficiecy_fig3.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/roberta.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero_offload_dp1_10B_cpu.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/oom_dp8_1.5B_log.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/data_efficiency/data_efficiecy_fig2.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/vl_moe.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero3-offload-1-v100.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero3-offload-memory-overview.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/opt-bloom.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero1_dp8_1.5B_smi.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/tput-llms.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/llm-latency-sd-latency-zoom.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/gpt.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zeropp/ZeROpp.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/mii/mii-arch.png
---
##
**URL:** https://www.deepspeed.ai/assets/images/zero2_dp32_10B_log.png
---
```
### references/index.md
```markdown
# Deepspeed Documentation Index
## Categories
### 08
**File:** `08.md`
**Pages:** 1
### 09
**File:** `09.md`
**Pages:** 2
### 2020
**File:** `2020.md`
**Pages:** 16
### 2023
**File:** `2023.md`
**Pages:** 21
### Assets
**File:** `assets.md`
**Pages:** 29
### Mii
**File:** `mii.md`
**Pages:** 1
### Other
**File:** `other.md`
**Pages:** 15
### Tutorials
**File:** `tutorials.md`
**Pages:** 59
```
### references/mii.md
```markdown
# Deepspeed - Mii
**Pages:** 1
---
## DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference
**URL:** https://www.deepspeed.ai/2022/10/10/mii.html
**Contents:**
- DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference
- Contents
- How does MII work?
- Supported Models and Tasks
- Inference Optimizations with MII
- MII-Public and MII-Azure
- Quantifying Latency and Cost Reduction
- Latency Critical Scenarios
- Cost Sensitive Scenarios
- Deployment Options
The Deep Learning (DL) open-source community has seen tremendous growth in the last few months. Incredibly powerful text generation models such as the Bloom 176B, or image generation models such as Stable Diffusion are now available to anyone with access to a handful or even a single GPU through platforms such as Hugging Face. While open-sourcing has democratized access to AI capabilities, their application is still restricted by two critical factors: 1) inference latency and 2) cost.
There has been significant progress in system optimizations for DL model inference that can drastically reduce both latency and cost, but those are not easily accessible. The main reason for this limited accessibility is that the DL model inference landscape is diverse with models varying in size, architecture, system performance characteristics, hardware requirements, etc. Identifying the appropriate set of system optimizations applicable to a given model and applying them correctly is often beyond the scope of most data scientists, making low latency and low-cost inference mostly inaccessible.
DeepSpeed Model Implementations for Inference (MII) is a new open-source python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.
Figure 1: MII Architecture, showing how MII automatically optimizes OSS models using DS-Inference before deploying them on-premises using GRPC, or on Microsoft Azure using AML Inference.
Under-the-hood MII is powered by DeepSpeed-Inference. Based on the model type, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of system optimizations from DeepSpeed-Inference to minimize latency and maximize throughput. It does so by using one of many pre-specified model injection policies, that allows MII and DeepSpeed-Inference to identify the underlying PyTorch model architecture and replace it with an optimized implementation (see Figure 1). In doing so, MII makes the expansive set of optimizations in DeepSpeed-Inference automatically available for thousands of popular models that it supports.
MII supports a growing list of tasks such as text generation, question-answering, text classification, etc, across thousands of transformer models available through multiple open-sourced model repositories such as Hugging Face, FairSeq, EluetherAI, etc. It supports dense models based on BERT, RoBERTa, GPT, OPT, and BLOOM architectures ranging from a few hundred million parameters in size to hundreds of billions of parameters in size. At the same time, it supports recent image generation models such as Stable Diffusion.
See the MII GitHub repo for an up-to-date list of models and tasks supported by MII.
Here we provide a summary of the expansive set of optimizations from DeepSpeed-inference made available via MII. For more details, please refer to [1, 2]:
DeepFusion for Transformers: For transformer-based models such as Bert, Roberta, GPT-2, and GPT-J, MII leverages the transformer kernels in DeepSpeed-Inference that are optimized to achieve low latency at small batch sizes and high throughput at large batch sizes using DeepFusion.
Multi-GPU Inference with Tensor-Slicing: For massive models such as Bloom 176B, MII automatically enables tensor-parallelism within a node to leverage aggregate memory bandwidth and compute across multiple GPUs to achieve the lowest latency and throughput compared to anything else that is currently available.
INT8 Inference with ZeroQuant: For massive models with tens or hundreds of billions of parameters, MII supports INT8 Inference with ZeroQuant. Using this feature not only reduces the memory footprint and the number of GPUs required for inference but also increases the inference throughput by supporting larger batch sizes and using INT8 compute, thus lowering cost compared to FP16.
ZeRO-Inference for Resource Constrained Systems: Models such as Bloom 176B, require over 176 GB of memory to just fit the model even with INT8 support. In the absence of the aggregate GPU memory across multiple GPUs required to deploy such models, MII enables ZeRO-Inference that can leverage the system CPU memory to deploy these massive models with a single GPU with limited memory.
Compiler Optimizations: When applicable, MII automatically applies compiler-based optimizations via TorchScript, nvFuser, and CUDA graph, in addition to the above optimizations, to further lower latency and improve throughput.
MII can work with two variations of DeepSpeed-Inference. The first, referred to as ds-public, contains most of the optimizations discussed above and is also available via our open-source DeepSpeed library. The second referred to as ds-azure, offers tighter integration with Azure, and is available via MII to all Microsoft Azure customers. We refer to MII running the two DeepSpeed-Inference variants as MII-Public and MII-Azure, respectively.
Both MII-Public and MII-Azure offer significant latency and cost reduction compared to open-sourced PyTorch implementation (Baseline). However for certain generative workloads, they can have differentiated performance: MII-Azure provides further improvements beyond MII-Public. We quantify the latency and cost reduction for both variations in the next section.
Inference workloads can be either latency critical, where the primary objective is to minimize latency, or cost sensitive, where the primary objective is to minimize cost. In this section, we quantify the benefits of using MII for both latency-critical and cost-sensitive scenarios.
For latency-critical scenarios, where a small batch size of 1 is often used, MII can reduce the latency by up to 6x for a wide range of open-source models, across multiple tasks. More specifically, we show model latency reduction of 1:
Up to 5.7x for multi-GPU inference for text generation using massive models such as Big Science Bloom, Facebook OPT, and EluetherAI NeoX (Figure 2 (left))
Up to 1.9x for image generation tasks model using Stable Diffusion (Figure 2 (right))
Up to 3x for relatively smaller text generation models (up to 7B parameters) based on OPT, BLOOM, and GPT architectures, running on a single GPU (Figures 3 and 4)
Up to 9x for various text representation tasks like fill-mask, text classification, question answering, and token classification using RoBERTa- and BERT- based models (Figures 5 and 6).
Figure 2: (left) Best achievable latency for large models. MII-Azure (int8) offers 5.7X lower latency compared to Baseline for Bloom-176B. (right) Stable Diffusion text to image generation latency comparison.
Figure 3: Latency comparison for OPT and BLOOM models. MII-Azure is up to 2.8x faster than baseline.
Figure 4: Latency comparison for GPT models. MII-Azure is up to 3x faster than baseline.
Figure 5: Latency comparison for RoBERTa models. MII offers up to 9x lower model latency and up to 3x lower end-to-end latency than baseline on several tasks and RoBERTa variants 1.
Figure 6: Latency comparison for BERT models. MII offers up to 8.9x lower model latency and up to 4.5x end-to-end latency across several tasks and BERT variants1.
MII can significantly reduce the inference cost of very expensive language models like Bloom, OPT, etc. To get the lowest cost, we use a large batch size that maximizes throughput for both baseline and MII. Here we look at the cost reduction from MII using two different metrics: i) tokens generated per second per GPU, and ii) dollars per million tokens generated.
Figures 7 and 8 show that MII-Public offers over 10x throughput improvement and cost reduction compared to the baseline, respectively. Furthermore, MII-Azure offers over 30x improvement in throughput and cost compared to the baseline.
Figure 7: Throughput comparison per A100-80GB GPU for large models. MII-Public offers over 15x throughput improvement while MII-Azure offers over 40x throughput improvement.
Figure 8: Cost of generating 1 million tokens on Azure with different model types. MII-Azure reduces the cost of generation by over 40x.
MII supported models can be deployed in two different ways as shown in Figure 1 with just a few lines of code.
MII-Public can be deployed on-premises or on any cloud offering. MII creates a lightweight GRPC server to support this form of deployment and provides a GRPC inference endpoint for queries. The code below shows how a supported model can be deployed with MII-Public Deployment.
MII supports deployment on Azure via AML Inference. To enable this, MII generates AML deployment assets for a given model that can be deployed using the Azure-CLI, as shown in the code below. Furthermore, deploying on Azure, allows MII to leverage DeepSpeed-Azure as its optimization backend, which offers better latency and cost reduction than DeepSpeed-Public.
To learn more about these deployment options and get started with MII, please the MII getting started guide.
We are very excited to share MII with the community and improve it with your feedback. We will continue to add support for more models in MII as well as enhance both MII-Public and MII-Azure for both on-premise and Azure users. Our hope is that while open sourcing has made powerful AI capabilities accessible to many, MII will allow for a wider infusion of these capabilities into a diverse set of applications and product offerings by instantly reducing the latency and cost of inferencing.
The table below shows the mapping between model aliases used in Figures 3, 4, 5, and 6 and real model names.
The end-to-end latency of an inference workload is comprised of two components: i) actual model execution, and ii) pre-/post-processing before and after the model execution. MII optimizes the actual model execution but leaves the pre-/post-processing pipeline for future optimizations. We notice that text representation tasks have significant pre-/post-processing overhead (Figures G and H). We plan to address those in a future update. ↩ ↩2 ↩3
Updated: October 10, 2022
**Examples:**
Example 1 (unknown):
```unknown
import mii
mii.deploy(task="text-to-image",
model="CompVis/stable-diffusion-v1-4",
deployment_name="sd-deployment")
```
Example 2 (unknown):
```unknown
import mii
mii.deploy(task="text-to-image",
model="CompVis/stable-diffusion-v1-4",
deployment_name="sd-deployment",
deployment_type=DeploymentType.AML)
```
---
```
### references/other.md
```markdown
# Deepspeed - Other
**Pages:** 15
---
## Training Overview and Features
**URL:** https://www.deepspeed.ai/training/
**Contents:**
- Training Overview and Features
- Contents
- Overview
- Distributed, Effective, and Efficient Training with Ease
- Speed
- Memory efficiency
- Scalability
- Communication efficiency
- Data efficiency
- Supporting long sequence length
Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training.
The DeepSpeed API is a lightweight wrapper on PyTorch. This means that you can use everything you love in PyTorch and without learning a new platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art training techniques, such as distributed training, mixed precision, gradient accumulation, and checkpoints so that you can focus on your model development. Most importantly, you can leverage the distinctive efficiency and effectiveness benefit of DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch models.
DeepSpeed achieves high performance and fast convergence through a combination of efficiency optimizations on compute/communication/memory/IO and effectiveness optimizations on advanced hyperparameter tuning and optimizers. For example:
DeepSpeed trains BERT-large to parity in 44 mins using 1024 V100 GPUs (64 DGX-2 boxes) and in 2.4 hours using 256 GPUs (16 DGX-2 boxes).
BERT-large Training Times
BERT code and tutorials will be available soon.
DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA Megatron on Azure GPUs.
Read more: GPT tutorial
DeepSpeed provides memory-efficient data parallelism and enables training models without model parallelism. For example, DeepSpeed can train models with up to 13 billion parameters on a single GPU. In comparison, existing frameworks (e.g., PyTorch’s Distributed Data Parallel) run out of memory with 1.4 billion parameter models.
DeepSpeed reduces the training memory footprint through a novel solution called Zero Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states and gradients to save significant memory. Furthermore, it also reduces activation memory and fragmented memory. The current implementation (ZeRO-2) reduces memory by up to 8x relative to the state-of-art. You can read more about ZeRO in our paper, and in our blog posts related to ZeRO-1 and ZeRO-2.
With this impressive memory reduction, early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category.
For model scientists with limited GPU resources, ZeRO-Offload leverages both CPU and GPU memory for training large models. Using a machine with a single GPU, our users can run models of up to 13 billion parameters without running out of memory, 10x bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
DeepSpeed supports efficient data parallelism, model parallelism, pipeline parallelism and their combinations, which we call 3D parallelism.
DeepSpeed can run large models more efficiently, up to 10x faster for models with various sizes spanning 1.5B to hundred billion. More specifically, the data parallelism powered by ZeRO is complementary and can be combined with different types of model parallelism. It allows DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering significant performance gains compared to using model parallelism alone.
Read more: ZeRO paper, and GPT tutorial.
The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.
Pipeline parallelism of DeepSpeed reduce communication volume during distributed training, which allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.
1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam, allowing for scaling to different types of GPU clusters and networks. 1-bit Adam blog post, 1-bit Adam tutorial, 0/1 Adam tutorial, 1-bit LAMB tutorial.
DeepSpeed Data Efficiency Library provides efficient data sampling via curriculum learning and efficient data routing via random layerwise token dropping. The composed solution enables up to 2x data and 2x time saving during GPT-3/BERT pretraining and GPT/ViT finetuning, or further improve model quality under the same data/time. See more in the tutorial.
DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3x faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures. Read more here.
DeepSpeed supports advanced hyperparameter tuning and large batch size optimizers such as LAMB. These improve the effectiveness of model training and reduce the number of samples required to convergence to desired accuracy.
Read more: Tuning tutorial.
Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to 13 billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.4 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA’s Megatron-LM.
Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.
title: “Feature Overview” layout: single permalink: /features/ toc: true toc_label: “Contents” —
Enable 16-bit (FP16) training by in the deepspeed_config JSON.
Easily switch between single-GPU, single-node multi-GPU, or multi-node multi-GPU execution by specifying resources with a hostfile.
The script <client_entry.py> will execute on the resources specified in <hostfile>.
DeepSpeed provides pipeline parallelism for memory- and communication- efficient training. DeepSpeed supports a hybrid combination of data, model, and pipeline parallelism and has scaled to over one trillion parameters using 3D parallelism. Pipeline parallelism can also improve communication efficiency and has accelerated training by up to 7x on low-bandwidth clusters.
DeepSpeed supports all forms of model parallelism including tensor slicing based approaches such as the Megatron-LM. It does so by only requiring the model parallelism framework to provide a model parallelism unit (mpu) that implements a few bookkeeping functionalities:
DeepSpeed is fully compatible with Megatron. Please see the Megatron-LM tutorial for details.
The Zero Redundancy Optimizer (ZeRO) is at the heart of DeepSpeed and enables large model training at a scale that is simply not possible with model parallelism alone. When enabled, ZeRO allows training models with over 13 billion parameters without any model parallelism, and up to 200 billion parameter models with model parallelism on current generation hardware.
For more details see the ZeRO paper, GPT tutorial on integration with DeepSpeed.
Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the model states (optimizer states, gradients and parameters) by 8x compared to standard data parallelism by partitioning these states across data parallel process instead of replicating them.
Activation Partitioning is a memory optimization in ZeRO that can reduce the memory consumed by activations during model parallel training (MP). In MP certain activations maybe required by all MP processes, resulting in a replication of activations across MP GPUs. Activation Partitioning stores these activations in a partitioned state once they are used for computation in the forward propagation. These activations are allgathered right before they are needed again during the backward propagation. By storing activations in a partitioned state, ZeRO in DeepSpeed can reduce the activation memory footprint proportional to the MP degree.
CBO enables high network and memory throughput while restricting memory usage to a constant size. For memory- and network-bound operations such as normalization or allreduce collectives, the performance depends on the size of the operand. Simply fusing all operands into a single large operand can enable great throughput at the expense of unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approximately a pre-defined sized buffer large enough to achieve great performance without the unnecessary memory overhead.
CMO reduces memory fragmentation during training, preventing out of memory errors due to lack of contiguous memory. Memory fragmentation is a result of interleaving between short lived and long lived memory objects. During the forward propagation activation checkpoints are long lived but the activations that recomputed are short lived. Similarly, during the backward computation, the activation gradients are short lived while the parameter gradients are long lived. CMO transfers activation checkpoints and parameter gradients to contiguous buffers preventing memory fragmentation.
ZeRO-Offload pushes the boundary of the maximum model size that can be trained efficiently using minimal GPU resources, by exploiting computational and memory resources on both GPUs and their host CPUs. It allows training up to 13-billion-parameter models on a single NVIDIA V100 GPU, 10x larger than the state-of-the-art, while retaining high training throughput of over 30 teraflops per GPU.
For more details see the ZeRO-Offload release blog, and tutorial on integration with DeepSpeed.
Gradient accumulation allows running larger batch size with limited memory by breaking an effective batch into several sequential micro-batches, and averaging the parameter gradients across these micro-batches. Furthermore, instead of averaging the gradients of each micro-batch across all GPUs, the gradients are averaged locally during each step of the sequence, and a single allreduce is done at the end of the sequence to produce the averaged gradients for the effective batch across all GPUs. This strategy significantly reduces the communication involved over the approach of averaging globally for each micro-batch, specially when the number of micro-batches per effective batch is large.
During back propagation, DeepSpeed can overlap the communication required for averaging parameter gradients that have already been computed with the ongoing gradient computation. This computation-communication overlap allows DeepSpeed to achieve higher throughput even at modest batch sizes.
The DeepSpeed core API consists of just a handful of methods:
DeepSpeed supports most of the features described in this document, via the use of these API, along with a deepspeed_config JSON file for enabling and disabling the features. Please see the core API doc for more details.
DeepSpeed’s Activation Checkpointing API supports activation checkpoint partitioning, cpu checkpointing, and contiguous memory optimizations, while also allowing layerwise profiling. Please see the core API doc for more details.
DeepSpeed handles gradient clipping under the hood based on the max gradient norm specified by the user. Please see the core API doc for more details.
DeepSpeed internally handles loss scaling for mixed precision training. The parameters for loss scaling can be specified in the deepspeed_config JSON file. Please see the core API doc for more details.
DeepSpeed has three communication-efficient optimizers called 1-bit Adam, 0/1 Adam and 1-bit LAMB. They offer the same convergence as Adam/LAMB, incur up to 26x less communication that enables up to 6.6x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance, please refer to the 1-bit Adam tutorial, 1-bit Adam blog post, 0/1 Adam tutorial and 1-bit LAMB tutorial. For technical details, please refer to the 1-bit Adam paper, 0/1 Adam paper and 1-bit LAMB paper.
With DeepSpeed, the user can choose to use a high performance implementation of ADAM from NVIDIA, or any training optimizer that extends torch’s torch.optim.Optimizer class.
We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses AVX-2 by default which can be switched to AVX-512 by setting the build flag, DS_BUILD_AVX512 to 1 when installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between 1 to 10 billion parameters with respect to torch-adam.
Mixed precision training is handled by the DeepSpeed FP16 Optimizer. This optimizer not only handles FP16 training but is also highly efficient. The performance of weight update is primarily dominated by the memory bandwidth, and the achieved memory bandwidth is dependent on the size of the input operands. The FP16 Optimizer is designed to maximize the achievable memory bandwidth by merging all the parameters of the model into a single large buffer, and applying the weight updates in a single kernel, allowing it to achieve high memory bandwidth.
DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Optimizer. For more details on LAMB, see the LAMB paper.
DeepSpeed can train models with up to 13 billion parameters without model parallelism, and models with up to 200 billion parameters with 16-way model parallelism. This leap in model size is possible through the memory efficiency achieved via the ZeRO Optimizer. For more details see ZeRO paper .
DeepSpeed can simplify checkpointing for you regardless of whether you are using data parallel training, model parallel training, mixed-precision training, a mix of these three, or using the zero optimizer to enable larger model sizes. Please see the Getting Started guide and the core API doc for more details.
DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for large batch scaling.
Please refer to the Learning Rate Range Test tutorial.
Please refer to the 1Cycle Learning Rate Schedule tutorial.
DeepSpeed abstracts away data parallelism and model parallelism from the user when it comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader can automatically handle batch creation appropriately.
Please refer to the Data Efficiency tutorial.
Please refer to the Curriculum Learning tutorial. Note that the Data Efficiency Library above provides more general curriculum learning support. This legacy curriculum learning feature is still supported but we recommend to use the Data Efficiency Library.
DeepSpeed provides a set of tools for performance analysis and debugging.
DeepSpeed provides a detailed breakdown of the time spent in different parts of the training. This can be enabled by setting the following in the deepspeed_config file.
When activation checkpointing is enabled, profiling the forward and backward time of each checkpoint function can be enabled in the deepspeed_config file.
The DeepSpeed flops profiler measures the time, flops and parameters of a PyTorch model and shows which modules or layers are the bottleneck. When used with the DeepSpeed runtime, the flops profiler can be configured in the deepspeed_config file as follows:
The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details.
The DeepSpeed Autotuner uses model information, system information, and heuristics to efficiently tune Zero stage, micro batch size, and other Zero configurations. Using the autotuning feature requires no code change from DeepSpeed users. While "autotuning": {"enabled": true} is the minimal required to enable autotuning, there are other parameters users can define to configure the autotuning process. Below shows major parameters and their default values in the autotuning configuration. Please refer to the Autotuning tutorial for more details.
The flops profiler can also be used as a standalone package. Please refer to the Flops Profiler tutorial for more details.
The DeepSpeed Monitor logs live training metrics to one or more monitoring backends, including PyTorch’s TensorBoard, WandB, or simply to CSV files. The Monitor can be configured with one or more backends in the deepspeed_config file as follows:
The Monitor can also be added to log custom metrics and client codes. Please refer to the Monitor tutorial for more details.
DeepSpeed provides logging of all communication operations launched within deepspeed.comm. The communication logger can be configured in the deepspeed_config file as follows:
Client codes can then print a summary with a call to deepspeed.comm.log_summary(). For more details and example usage, see the Communication Logging tutorial.
DeepSpeed offers sparse attention to support long sequences. Please refer to the Sparse Attention tutorial.
To learn more about training Mixture of Experts (MoE) models with DeepSpeed, see our tutorial for more details.
**Examples:**
Example 1 (unknown):
```unknown
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"consecutive_hysteresis": false,
"min_loss_scale": 1
}
```
Example 2 (unknown):
```unknown
deepspeed --hostfile=<hostfile> \
<client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
```
Example 3 (unknown):
```unknown
mpu.get_model_parallel_rank()
mpu.get_model_parallel_group()
mpu.get_model_parallel_world_size()
mpu.get_data_parallel_rank()
mpu.get_data_parallel_group()
mpu.get_data_parallel_world_size()
```
Example 4 (unknown):
```unknown
{
"gradient_clipping": 1.0
}
```
---
## Latest News
**URL:** https://www.deepspeed.ai/
**Contents:**
- Latest News
- Contents
- Extreme Speed and Scale for DL Training
- DeepSpeed Adoption
- Contributing
- Contributor License Agreement
- Code of Conduct
- Publications
- Videos
[2025/10] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
[2025/10] Study of ZenFlow and ZeRO offload performance with DeepSpeed CPU core binding
[2025/08] ZenFlow: Stall-Free Offloading Engine for LLM Training
[2025/06] Arctic Long Sequence Training (ALST) with DeepSpeed: Scalable And Efficient Training For Multi-Million Token Sequences
[2025/06] DeepNVMe: Affordable I/O scaling for Deep Learning Applications
DeepSpeed enabled the world’s most powerful language models (at the time of this writing) such as MT-530B and BLOOM. DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations include ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc.
DeepSpeed has been used to train many different large-scale models. Below is a list of several examples that we are aware of (if you’d like to include your model please submit a PR):
DeepSpeed has been integrated with several different popular open-source DL frameworks such as:
DeepSpeed is an integral part of Microsoft’s AI at Scale initiative to enable next-generation AI capabilities at scale.
DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang. (2024) Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training arXiv:2406.18820
---
## Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed
**URL:** https://www.deepspeed.ai/2022/03/20/amd-support.html
**Contents:**
- Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed
- Contents
Updated: March 20, 2022
---
## DeepSpeed Configuration JSON
**URL:** https://www.deepspeed.ai/docs/config-json/
**Contents:**
- DeepSpeed Configuration JSON
- Contents
- Batch Size Related Parameters
- Optimizer Parameters
- Scheduler Parameters
- Communication options
- FP16 training options
- BFLOAT16 training options
- Automatic mixed precision (AMP) training options
- Gradient Clipping
Note: train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.
train_batch_size: [integer]
train_micro_batch_size_per_gpu: [integer]
gradient_accumulation_steps: [integer]
optimizer: [dictionary]
Example of optimizer with Adam
The Adam optimizer also supports the following two params keys/values in addition to the standard parameters from torch.optim.Adam:
Another example of optimizer with 1-bit Adam specific parameters is as follows.
The 1-bit Adam optimizer supports the following three params keys/values in addition to the standard Adam (learn more in our tutorial):
A variant optimizer for 1-bit Adam is 0/1 Adam, which further optimizes 1-bit Adam via adaptive variance freezing and 1-bit synchronization over optimizer states.
0/1 Adam supports the following params key/values in addition to standard Adam (learn more in our tutorial.)
Another example of optimizer with 1-bit LAMB
The 1-bit LAMB optimizer supports the following params keys/values in addition to the standard LAMB (learn more in our tutorial):
DeepSpeed calls the step() method of the scheduler at every training step when model_engine.step() is executed.
scheduler: [dictionary]
communication_data_type: [string]
prescale_gradients: [boolean]
gradient_predivide_factor: [float]
sparse_gradients: [boolean]
Note: this mode cannot be combined with the amp mode described below.
fp16:enabled: [boolean]
fp16:auto_cast: [boolean]
fp16:loss_scale: [float]
fp16:initial_scale_power: [integer]
fp16:loss_scale_window: [integer]
fp16:hysteresis: [integer]
fp16:consecutive_hysteresis: [boolean]
fp16:min_loss_scale: [integer]
Note: this mode cannot be combined with the amp mode described below.
Note: this mode cannot be combined with the fp16 mode described above.
bf16:enabled: [boolean]
Note: this mode cannot be combined with the fp16 mode described above. In addition this mode is not currently compatible with ZeRO.
amp:enabled: [boolean]
amp params: [various]
gradient_clipping: [float]
Enabling and configuring ZeRO memory optimizations
zero_optimization: [dictionary]
allgather_partitions: [boolean]
allgather_bucket_size: [integer]
overlap_comm: [boolean]
reduce_scatter: [boolean]
reduce_bucket_size: [integer]
contiguous_gradients: [boolean]
load_from_fp32_weights: [boolean]
grad_hooks: [boolean]
round_robin_gradients: [boolean]
offload_param: [dictionary]
offload_optimizer: [dictionary]
stage3_max_live_parameters: [integer]
stage3_max_reuse_distance: [integer]
stage3_prefetch_bucket_size: [integer]
stage3_param_persistence_threshold: [integer]
stage3_gather_16bit_weights_on_model_save: [boolean]
stage3_module_granularity_threshold: [integer] | Description | Default | |——————————————————————————————————————————————————————————————————————————————————————————–| ——- | | The granularity of a module is determined by the ratio of parameter_count / (1 + descendant_count). ZeRO3 classifies modules with a granularity below the threshold as fine-grained, treating them as integral units during parameter fetching. This reduces host and communication overhead from separate hooks. | 0 |
zero_hpz_partition_size: [integer]
zero_quantized_weights: [boolean]
zero_quantized_gradients: [boolean]
log_trace_cache_warnings: [boolean]
cpu_offload: [boolean]
Deprecated: cpu_offload is deprecated and will be removed in future, please use offload_optimizer instead.
Enabling and configuring ZeRO optimization of parameter offloading to CPU/NVMe. Available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered.
pin_memory: [boolean]
buffer_count: [integer]
buffer_size: [integer]
max_in_cpu: [integer]
Enabling and configuring ZeRO optimization of offloading optimizer computation to CPU and state to CPU/NVMe. CPU offloading is available with ZeRO stage 1, 2, 3. NVMe offloading is available only with ZeRO stage 3. Note that if the value of “device” is not specified or not supported, an assertion will be triggered.
pin_memory: [boolean]
buffer_count: [integer]
Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio).
block_size: [integer]
queue_depth: [integer]
thread_count: [integer]
single_submit: [boolean]
overlap_events: [boolean]
ignore_unused_parameters: [boolean]
steps_per_print: [integer]
wall_clock_breakdown: [boolean]
dump_state: [boolean]
results_dir: [string]
start_profile_step: [integer]
end_profile_step: [integer]
max_train_batch_size: [int]
num_tuning_micro_batch_sizes: [integer]
tuner_early_stopping: [integer]
tuner_num_trials: [integer]
profile_step: [integer]
module_depth: [integer]
top_modules: [integer]
output_file: [string]
partition_activations: [boolean]
cpu_checkpointing: [boolean]
contiguous_memory_optimization: [boolean]
number_checkpoints: [integer]
synchronize_checkpoint_boundary: [boolean]
sparse_attention: [dictionary]
Example of sparse_attention
DeepSpeed Data Efficiency Library includes two techniques: curriculum learning and random layerwise token dropping (random-LTD). Read more about how to use the DeepSpeed Data Efficiency Library in our tutorial.
data_efficiency: [dictionary]
data_routing: [dictionary]
data_sampling: [dictionary]
random_ltd: [dictionary]
curriculum_learning: [dictionary]
Note: On 12/12/2022, we released DeepSpeed Data Efficiency Library which provides a more general curriculum learning support. This legacy curriculum learning feature below is still supported but we recommend to use the Data Efficiency Library.
curriculum_type: [string]
min_difficulty: [integer]
max_difficulty: [integer]
schedule_type: [string]
total_curriculum_step: [integer]
difficulty_step: [integer]
root_degree: [integer]
difficulty: [list of integer]
max_step: [list of integer]
Note: Deepspeed logs to TensorBoard through PyTorch. Logging to TensorBoard requires that the tensorboard package is installed (read more in the PyTorch documentation).
Note: Logging to WandB requires that the wandb package is installed (read more in the WandB documentation).
Note: Logging to Comet requires that the comet_ml package is installed (read more in the Comet documentation).
Deepspeed’s Monitor module can log training details into a Tensorboard-compatible file, to WandB, to Comet or to simple CSV files. Below is an overview of what DeepSpeed will log automatically.
tensorboard: [dictionary]
Example of tensorboard configuration:
Example of wandb configuration:
Example of comet configuration:
csv_monitor: [dictionary]
Example of csv_monitor configuration:
DeepSpeed provides a flexible communication logging tool which can automatically detect and record communication operations launched via deepspeed.comm. NOTE: All logging communication calls are synchronized in order to provide accurate timing information. This may hamper performance if your model heavily uses asynchronous communication operations.
Once the logs are populated, they can be summarized with deepspeed.comm.log_summary(). For more detail and example usage, see the tutorial
comms_logger: [dictionary]
Example of recommended comms_logger configuration:
Example of comms_logger configuration for logging specific operations only:
Note: Compression has seven different components, including layer reduction, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning. We explain them one by one with simple json examples. Read more about how to use the DeepSpeed Compression library in our tutorial.
Note: Layer reduction works much better when using knowledage distillation (learn more in our tutorial):
layer_reduction: [dictionary]
shared_parameters: [dictionary]
Shared parameters for all weight quantization groups.
different_groups: [dictionary]
Different quantization sets, this is used for different quantization parameters. In this example, we give two different sets. In practice, you can choose the number of sets based on your requirements.
shared_parameters: [dictionary]
Shared parameters for all activation quantization groups.
different_groups: [dictionary]
Different quantization sets, this is used for different quantization parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements.
shared_parameters: [dictionary]
Shared parameters for all sparse pruning groups.
different_groups: [dictionary]
Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements. Note for snip_momentum method, you can leave it as empty.
Note: Row Pruning is a feature designed for two back-to-back linear layers (e.g., Feed Forward Network in Transformers). As such, we suggested use row pruning for the first linear layer (i.e., the intermediate.dense layer for BERT). Reducing the row dimension of this matrix can help reducing the column of the follow-up matrix (i.e., layer.\\w+.output.dense layer for BERT). It should also work for other linear layers as well.
shared_parameters: [dictionary]
Shared parameters for all row pruning groups.
different_groups: [dictionary]
Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements.
Note: Head Pruning is a feature designed for two attention layers (e.g., Multi Head Attention in Transformers). For now, it can only be applied to output matrix of the Transformer (i.e., attention.output.dense in BERT). Pruning the output matrix can lead to the pruning of Query/Key/Value matrix as well.
shared_parameters: [dictionary]
Shared parameters for all head pruning groups.
different_groups: [dictionary]
Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements.
Note: Channel Pruning is a feature designed for two back-to-back CONV2d layers (e.g., residual connection in ResNet). As such, we suggested use channel pruning for the first CONV2d layer. Reducing the number of output channels of this layer can help reducing the number of input channels the follow-up layer. It should also work for other CONV2d layers as well.
shared_parameters: [dictionary]
Shared parameters for all channel pruning groups.
different_groups: [dictionary]
Different pruning sets, this is used for different pruning parameters. In this example, we give one set. In practice, you can choose the number of sets based on your requirements.
load_universal: [boolean]
use_node_local_storage: [boolean]
pipeline_stage: [boolean]
**Examples:**
Example 1 (unknown):
```unknown
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
```
Example 2 (unknown):
```unknown
"optimizer": {
"type": "OneBitAdam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7,
"freeze_step": 400,
"cuda_aware": false,
"comm_backend_name": "nccl"
}
}
```
Example 3 (unknown):
```unknown
"optimizer": {
"type": "ZeroOneAdam",
"params": {
"lr": 1e-3,
"weight_decay": 0.01,
"bias_correction": false,
"var_freeze_step": 1000,
"var_update_scaler": 16,
"local_step_scaler": 1000,
"local_step_clipper": 16,
"cuda_aware": false,
"comm_backend_name": "nccl"
}
}
```
Example 4 (unknown):
```unknown
"optimizer": {
"type": "OneBitLamb",
"params": {
"lr": 11e-3,
"weight_decay": 0.01,
"bias_correction": false,
"max_coeff": 0.3,
"min_coeff": 0.01,
"freeze_step": 1000,
"cuda_aware": false,
"comm_backend_name": "nccl",
"coeff_beta": 0.9,
"factor_max": 4.0,
"factor_min": 0.5,
"factor_threshold": 0.1
}
}
```
---
## DeepSpeed ZeRO-3 Offload
**URL:** https://www.deepspeed.ai/2021/03/07/zero3-offload.html
**Contents:**
- DeepSpeed ZeRO-3 Offload
- Contents
- Overview of ZeRO family of technology
- ZeRO-3 Offload
- Unprecedented model scale
- Ease of supporting very large models
- Excellent training efficiency
- How to use ZeRO-3 Offload
Today we are announcing the release of ZeRO-3 Offload, a highly efficient and easy to use implementation of ZeRO Stage 3 and ZeRO Offload combined, geared towards our continued goal of democratizing AI by making efficient large-scale DL training available to everyone. The key benefits of ZeRO-3 Offload are:
The ZeRO Redundancy Optimizer (abbreviated ZeRO) is a family of memory optimization technologies for large-scale distributed deep learning. Unlike data parallelism (that is efficient but can only support a limited model size) or model parallelism (that can support larger model sizes but requires significant code refactoring while adding communication overhead that limits efficiency), ZeRO allows fitting larger models in memory without requiring code refactoring while remaining very efficient. ZeRO does so by eliminating the memory redundancy that is inherent in data parallelism while limiting the communication overhead to a minimum. ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency. There are three stages in ZeRO corresponding to three model states, as shown in the Figure 1: the first stage (ZeRO-1) partitions only the optimizer states, the second stage (ZeRO-2) partitions both the optimizer states and the gradients and the final stage (ZeRO-3) partitions all three model states (for more details see the ZeRO paper).
Figure 1. Overview of ZeRO memory savings
In addition to these three stages, ZeRO family of technology also consists of ZeRO-2 Offload. ZeRO-2 Offload is a heterogeneous DL training technology that works in conjunction with ZeRO-2 to offload partitioned optimizer states and gradients to CPU memory. ZeRO-2 Offload offers the full memory advantage of ZeRO-2 even on a single GPU, while at the same time offering great scalability of ZeRO-2 on multi-GPU setup. DeepSpeed library has been offering ZeRO-2 Offload since Sept 2020. For details, please see below:
With today’s release of ZeRO-3 Offload, we are adding support for partitioning and offloading parameters in addition to optimizer states and gradients partitioning already supported by ZeRO-2 Offload in DeepSpeed. With parameter partitioning ZeRO-3 Offload implements the full set of features in the three stages of ZeRO, that allows for a linear growth in model size with the number of GPUs. In addition, ZeRO-3 Offload can also optionally offload all these model states to CPU to further reduce GPU memory consumption, leveraging both CPU and GPU to maximize memory and compute efficiency of the entire system.
We believe ZeRO-3 Offload offers a massive leap for large model training, in three regards:
i) Unprecedented model scale,
ii) Ease of supporting very-large models, and
iii) Achieving excellent training efficiency.
Unlike ZeRO-2 and ZeRO-Offload where the parameters have to fit in the memory of a single GPU, ZeRO-3 Offload can partition the parameters across GPUs, and offload them to CPU, supporting model sizes that are much larger than the memory on a single GPU. Furthermore, ZeRO-3 Offload goes beyond the state-of-the-art hybrid 3D-parallelism (data, model and pipeline parallelism combined). While 3D Parallelism is limited by the aggregate GPU memory, ZeRO-3 Offload can exploit both GPU and CPU memory, the latter of which is much larger and cheaper compared to GPU memory. This allows ZeRO-3 Offload to train larger model sizes with the given GPU and CPU resources than any other currently available technology.
Model Scale on Single GPU: ZeRO-3 Offload can train models with over 40B parameters efficiently on a single GPU (e.g., 32GB V100 GPU + 1.5TB CPU memory). This is 3x larger than what is possible with ZeRO-2 Offload, the current state-of-the art.
Model Scale on Multi-GPUs: With ZeRO-3 Offload you can train a trillion and two trillion parameter models on NVIDIA 32GB V100 DGX-2 cluster with 256 GPUs and 512 GPUs, respectively. In contrast, the state-of-art 3D Parallelism requires 800 GPUs, and 1600 GPUs, respectively, to fit the same sized models. This represents a 3x reduction in GPUs required to fit models with over a trillion parameters.
From a system perspective, training models with hundreds of billions and trillions of parameters is extremely challenging. Data parallelism cannot scale the model size much further beyond a billion parameters, model parallelism (with tensor slicing) cannot be used to scale model size efficiently beyond a single node boundary due to massive communication overheads, and pipeline parallelism cannot scale beyond the number of layers available in a model, which limits both the model size and the number of GPUs that it can scale to.
The only existing parallel technology available that can scale to over a trillion parameters on massively parallel GPU clusters is the 3D parallelism that combines data, model and pipeline parallelism in complex ways. While such a system can be very efficient, it requires major model code refactoring from data scientists to split the model into load balanced pipeline stages. This also makes 3D parallelism inflexible in the type of models that it can support, since models with complex dependency graphs cannot be easily converted into a load balanced pipeline.
ZeRO-3 Offload address these challenges in two ways:
i) With ground-breaking memory efficiency, ZeRO-3 and ZeRO-3 Offload are the only DL parallel technology that can efficiently scale to over a trillion parameters by itself, without requiring a hybrid parallelism strategy, greatly simplifying the system stack for DL training.
ii) ZeRO-3 Offload requires virtually no model refactoring from model scientists, liberating data scientists to scale up complex models to hundreds of billions to trillions of parameters.
High-performance per-GPU throughput on multiple nodes: ZeRO-3 Offload offers excellent training efficiency for multi-billion and trillion parameter models on multiple nodes. It achieves a sustained throughput of up to 50 Tflops per GPU running on 32 DGX2 nodes comprising 512 NVIDIA V100 GPUs (see Figure 2). In comparison, the standard data parallel training with PyTorch can only achieve 30 TFlops per GPU for a 1.2B parameter model, the largest model that can be trained using data parallelism alone.
Figure 2. ZeRO-3 Offload: Multi-billion and trillion parameter model throughput on 512 V100 GPUs
ZeRO-3 Offload obtains high efficiency despite the 50% communication overhead of ZeRO Stage 3 compared to standard data parallel training for a fixed batch size. This is made possible through a communication overlap centric design and implementation, which allows ZeRO-3 Offload to hide nearly all of the communication volume with computation, while taking advantage of a larger batch size for improved efficiency resulting from better GPU memory efficiency.
Efficient multi-billion parameter model training on a single GPU: ZeRO-3 Offload further democratizes AI by enabling efficient training of multi-billion parameter models on a single GPU. For single GPU training, ZeRO-3 Offload provides benefits over ZeRO-2 Offload along two dimensions. First, ZeRO-3 Offload increases the size of models trainable on a single V100 from 13B to 40B. Second, for ZeRO-3 Offload provides speedups (e.g., 2.3X for 13B) compared to ZeRO-2 Offload for model sizes trainable by both solutions. These results are summarized in Figure 3.
Figure 3. Multi-billion parameter model training on one V100 GPU
Super-Linear scalability across GPUs: Additionally, ZeRO-3 Offload also preserves the super-linear scalability characteristics that we have demonstrated with all our previous ZeRO technologies (ZeRO Stage 1, ZeRO Stage 2 and ZeRO Offload). ZeRO-3 Offload can exploit the aggregate PCI-E bandwidth between GPU and CPU across all the GPUs in multi-GPU training configuration, and at the same time, it can also exploit the aggregate CPU compute across all the nodes. As a result, the CPU-GPU-CPU communication time as well as the optimizer update time decreases linearly with number of GPUs and nodes, respectively, allowing ZeRO-3 Offload to exhibit super-linear scaling (see Figure 4).
Figure 4. ZeRO-3 Offload Superlinear Scalability for a 200B parameter model.
As with many other existing DeepSpeed features, once the user model has been converted to use DeepSpeed, enabling ZeRO-3 Offload is as easy as turning on a couple of flags in DeepSpeed Config file. Supporting advanced features like weight sharing, or enabling extremely large models that requires to be partitioned across GPUs/nodes to fit in GPU/CPU memory, can be done with just a couple of additional lines of code change using the ZeRO-3 Offload API.
If you are already a DeepSpeed user, you can find our detailed tutorial on ZeRO-3 Offload below. If you are new to DeepSpeed, we recommend that you start at the getting started page before trying out our ZeRO-3 Offload Tutorial.
DeepSpeed: Getting Started Page
ZeRO-3 Offload Documentation, Tutorial
The DeepSpeed Team is very excited to share ZeRO-3 Offload with the DL community.
Updated: March 7, 2021
---
## DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
**URL:** https://www.deepspeed.ai/2022/01/18/moe-inference.html
**Contents:**
- DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
- Contents
Updated: January 18, 2022
---
## Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed
**URL:** https://www.deepspeed.ai/2022/07/25/deepspeed-azure.html
**Contents:**
- Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed
- Contents
- Introduction
- Making distributed training faster and easier on Azure using DeepSpeed
- Key Performance Benefits
- Experimental Setup
- Hardware (Azure instances)
- Training setup using AzureML
- Training setup using Azure VMSS
- Performance Evaluation on Various Model Configurations
Large-scale transformer-based deep learning models trained on large amounts of data have shown great results in recent years in several cognitive tasks and are behind new products and features that augment human capabilities. These models have grown several orders of magnitude in size during the last five years. Starting from a few million parameters of the original transformer model all the way to the latest 530 billion-parameter Megatron-Turing model as shown in Figure 1. There is a growing need for customers to train and fine tune large models at an unprecedented scale.
Figure 1: Landscape of large models and hardware capabilities
To train these models, users needed to set up and maintain a complex distributed training infrastructure that usually required several manual and error-prone steps. These lead to a subpar experience both in terms of usability and performance. We recently announced how we are making great strides to simplify this and enable easy-to-use and high-performance training at 1K+ GPU scale on Azure.
In this extended post, we share the details of how DeepSpeed users can train trillion-parameter models with a new easy-to-use, streamlined, scalable, and high-performance distributed training experience on Azure. We also share details of the experimental setup, model configurations, additional performance trends, and guide our users on how to run these experiments in their own environments.
We compare the existing manual and error-prone workflow with our proposed easy-to-use workflow for DeepSpeed on Azure in Figure 2. Customers can now use easy-to-use training pipelines to launch training jobs at scale. The new workflow reduces the number of steps from 11 to just 1 if users rely on the recommended AzureML recipes.
Figure 2: An easy-to-use and streamlined distributed training experience with DeepSpeed on Azure
For users who have custom environments built using Azure VMs or Azure VMSS, only two steps are needed:
We already shared a summary of our key performance results in the Azure announcement. We enable the capability to train 2x larger model sizes (2 trillion vs. 1 trillion parameters), scale to 2x more GPUs (1024 vs. 512), and offer up to 1.8x higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) compared to other cloud providers.
DeepSpeed on Azure offers near-linear scalability both in terms of increase in model size as well as increase in number of GPUs. As shown in Figure 3a, together with the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we were able to maintain an efficient throughput/GPU (>157 TFLOPs) in a near-linear fashion as the model size increases from 175 billion parameters to 2 trillion parameters. On the other hand, for a given model size, e.g., 175B, we achieve near-linear scaling as we increase the number of GPUs from 128 all the way to 1024 as shown in Figure 3b. The key takeaway is that Azure and DeepSpeed together are breaking the GPU memory wall and enabling our customers to easily and efficiently train trillion-parameter models at scale.
Figure 3: (a) Near-perfect throughput/GPU as we increase the model size from 175 billion to 2 trillion parameters (BS/GPU=8). (b) Near-perfect performance scaling with the increase in number of GPU devices for the 175B model (BS/GPU=16). The sequence length is 1024 for both cases.
We share the details of our experimental setup and some of the best practices we followed. The users can either directly use them to reproduce our results or modify them to fit their own setup in terms of model scale as well as the scale of Azure hardware being provisioned.
We used NDm A100 v4-series instances in our experiments. Each instance includes two socket AMD EPYC 7V12 64-Core CPUs, 1.7TB main memory and eight A100 80GB GPUs. The system has a balanced PCIe topology connecting 4 GPU devices to each CPU socket. Each GPU within the VM is provided with its own dedicated, topology-agnostic 200 Gb/s NVIDIA Mellanox HDR InfiniBand connection providing an accelerated 200 Gbps high speed fabric. The DeepSpeed library exploits offload capabilities where the activation and optimizer states are allocated in the main memory. Hence, 1.7TB memory capacity per node helps us to scale to large model sizes.
Users can directly use the AzureML studio and use our published recipes to run experiments without any additional setup. This is the easiest and recommended way of running experiments on Azure.
Existing VMSS customers and others who have custom Azure VM based environments can follow the setup as follows. The scripts to make these steps easy will be released in the coming weeks. A cluster is created using Azure Virtual Machine Scale Sets (VMSS) to provision the desired number of compute nodes running the new Azure HPAI VM image specialized for extreme-scale deep learning applications using the software stack listed in Table 1.
Table 1: Detailed version information of the software packages in the Azure HPC VM image
Users can create a VMSS with up to 600 VM instances enabling up to 4,800 A100 GPUs. In addition to the VMSS for the compute nodes, we provision a distinct login node using an inexpensive D4s v4 (or similar) instance with 4-core Intel VCPU, running the same image, for compiling, launching, and monitoring jobs. The login node, compute nodes, and a shared storage filesystem are grouped within an Azure Virtual Network (vnet) allowing VMs to connect to each other over SSH and to shared NFS volume shown in Figure 4.
Figure 4: Organization of our VMSS-based experimental setup
We ran our experiments with four different model sizes – 175B, 530B, 1T, and 2T – using the configurations shown in Table 2.
Table 2: Model configuration
For each of these configurations, we report peak throughput of the system using TFLOPs/GPU as the main performance metric. To calculate TFLOPs, we use the formula used by the Megatron paper as shown below.
FLOPs/GPU = 96 * B * s * l * h2 * (1 + s/6h + V/(16*l*h))
B is batch size, s is sequence length, l is the number of layers, h is hidden size, and V is vocabulary size.
Figures 5a and 5b show the results of 175B model with sequence length 512 and 1024, respectively. We only scale to 512 GPUs for seq-length 512 as adding more GPUs shows similar performance. On the other hand, with sequence length 1024, we saw linear performance increase to 1024 GPUs. Overall, the peak throughput of 204.49 TFLOPs/GPU was achieved on 256 GPUs with a micro batch size of 32 and sequence length of 512.
Figure 5: Performance characteristics of 175B model on 512 and 1K GPUs respectively. The colored columns signify different micro batch sizes.
Next, we report the 530B model scaling. Previous results on the 530B MT-NLG model using DeepSpeed and Megatron-LM on 280 DGX A100 servers on the Selene supercomputer showed the peak throughput of 126 TFLOPS/GPU. However, we were able to surpass that throughput and achieved up to 171.37 TFLOPs/GPU on 128 NDm A100 v4-series A100 systems (i.e., 1024 GPUs) as shown in Figure 6.
The benefit of this 530B model is its simpler parallelization configuration as there is no tensor/pipeline parallelism. With ZeRO powered data parallelism, there are fewer heuristics required to optimally configure the distributed model. In addition, the consistent steady state performance of more than 140 TFLOPs/GPU for micro batch sizes >1 demonstrates a robust software and hardware platform.
Figure 6: Throughput achieved with a 530B parameter model on 512 and 1024 GPUs for micro-batch sizes per GPU of 1, 2, 4, and 8, with sequence length 1,024.
The 1T parameter model contains 128 layers with 160 attention heads. Training such an extreme-scale model is not an easy task. Figure 7 shows the throughput achieved for each of the model configurations we explored on 512 and 1024 GPUs. Peak throughput achieved was 165.36 TFLOPs/GPU for micro batch size of 8 across 1024 GPUs and the model reached steady state performance within the first 3-4 iterations.
Figure 7: Performance characteristics of 1T parameter model on 512 and 1024 GPUs with 1, 2, 4, and 8 micro batch sizes, with sequence length 1,024.
The 2T parameter model consists of 160 layers, 32k hidden dimension, and 128 attention heads. Given the large size of the model and the significant time required on 1024 GPUs, we limited our benchmark runs for the 2T model to a batch size of 8 per GPU with a sequence length of 1024. We were able to achieve 157 TFLOPs/GPU on 1,024 GPUs.
We recognize that DeepSpeed users are diverse and have different environments. In this tutorial, our focus is on making things simpler for users who plan to run large model training experiments on Azure.
The easiest way to do model training on Azure is via the Azure ML recipes. The job submission and data preparation scripts have been made available here. Users simply need to setup their Azure ML workspace following the guide and submit experiment using the aml_submit.py file.
Some users have customized environments built on top of Azure VMs and VMSS based clusters. To simplify training on such setups, we are working on an easy-to-use cluster setup script that will be published in the next few weeks. If you already have a cluster setup running, you can use the azure recipes for the 175B and the 1T model. The recipes can easily be modified to train other model configurations.
This blog post was written by the DeepSpeed team in collaboration with the AzureML and the AzureHPC team. We would like to acknowledge several individuals who made this work possible:
Updated: July 25, 2022
---
## DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality
**URL:** https://www.deepspeed.ai/2022/12/11/data-efficiency.html
**Contents:**
- DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality
- Contents
- Efficient Data Sampling via Curriculum Learning
- Motivation
- Design
- Evaluation Results
- Efficient Data Routing via Random Layerwise Token Dropping
- Motivation
- Design
- Evaluation Results
Recently, large-scale deep learning models are empowering us to achieve more in many ways, such as improving programming efficiency by code generation and providing art inspiration by text-to-image generation. To enable these services and keep improving the quality, deep learning model architecture evolves rapidly, and the model size is also growing at a tremendous speed. For example, from GPT to GPT-3 the model size increased 1500x in 2 years. The increasing model size leads to unprecedented training cost, making it challenging for many AI practitioners to train their own models. On the other hand, a less-emphasized perspective is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. In Figure 1 below we plot the model and data scales of several representative language models in the last 5 years. From the oldest model on the left to the newest models on the right, both the model and data scales increase at similar speed. This demonstrates the importance of improving data efficiency: achieve same model quality with less data and reduced training cost, or achieve better model quality with the same amount of data and similar training cost.
Figure 1: Model scale (number of parameters) and data scale (number of tokens consumed during training) of representative language models in the last 5 years.
There are two popular research directions among existing data efficiency techniques: Data sampling techniques aim to improve the convergence speed by sampling the most suitable next data batch from the whole data pool; Data routing techniques aim to reduce the computation by routing each data to only a subset of the model components. These techniques improve data and training efficiency, but existing solutions on them have limitations on extensibility, flexibility, and composability. They are commonly designed for specific training tasks, making them hard to be extended with customized strategies and making them less flexible to be applied on diverse workloads from different users. Furthermore, different techniques are implemented separately, making it challenging to compose multiple solutions to further improve data and training efficiency.
To address these challenges, we, the DeepSpeed team as part of Microsoft’s AI at Scale initiative, are proud to announce DeepSpeed Data Efficiency Library – a composable framework that makes better use of data, increases training efficiency, and improves model quality. DeepSpeed Data Efficiency takes extensibility, flexibility, and composability into consideration, and it specifically demonstrates the following innovations:
Efficient data sampling via curriculum learning. Curriculum learning (CL) improves data efficiency by sampling from easier data. We present a general curriculum learning library which enables users to employ curriculum learning to their models at maximum extensibility: users can easily analyze, index, and sample their training data based on various customizable strategies. Using this library, we were able to explore different CL strategies for GPT-3 and BERT pretraining and identify the best solution that provides up to 1.5x data saving while still maintaining similar model quality.
Efficient data routing via random layerwise token dropping. We present a novel data routing technique called random layerwise token dropping (random-LTD) to skip the computation of a subset of the input tokens at all middle layers. Random-LTD employs a simple yet effective routing strategy and requires minimal model architecture change. It is flexible to apply random-LTD to various tasks (GPT-3/BERT pretraining and GPT/ViT finetuning), and we achieve great data efficiency improvement (up to 1.5x data saving while still maintaining the model quality).
Seamlessly composing multiple methods. The proposed DeepSpeed Data Efficiency framework seamlessly composes the curriculum learning and random-LTD techniques, and only requires minimal changes on the user code side. Furthermore, by composing both methods we can achieve even better data and training efficiency: for GPT-3 1.3B pretraining, we achieve 2x data and 2x time savings together with better or similar model quality compared to the baseline training. When using the same amount of data, our approach further improves the model quality over the baseline. Users can also extend and contribute to the library by adding additional data efficiency techniques to compose together.
Each of these advances is explored further in the blog post below. For more about the technical details, please read our papers, “Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers” which describes the random-LTD technique, and “DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing” which describes the curriculum learning technique and overall DeepSpeed Data Efficiency framework.
Curriculum learning aims to improve training convergence speed by presenting relatively easier or simpler examples earlier during training. Building a curriculum learning solution usually requires two components: the difficulty metric (i.e., how to quantify the difficulty of each data sample) and the pacing function (i.e., how to decide the curriculum difficulty range when sampling next training data batch). Curriculum learning has been successfully applied to various training tasks, and last year we also released a specific curriculum learning technique (sequence length warmup) for GPT-style model pretraining (see technical details in our paper “The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models” published in NeurIPS 2022). However, one common limitation among existing works is that there does not exist a generalized and extensible curriculum learning library, which allows practitioners to easily apply custom curriculum difficulty metrics, the combination of metrics, and pacing functions.
To solve the limitation of existing solutions, we design and implement a general curriculum learning library emphasizing the extensibility. It consists of three components as shown in Figure 2 below (top part). First, we use a data analyzer to perform the offline CPU-only data analysis which indexes the whole data pool based on any difficulty metric such as the sequence length, the vocabulary rarity, or anything defined by user. Next, during training, the curriculum scheduler determines the difficulty threshold for the current step based on a pacing function such as linear, rooted, or any strategy provided by users. Then the data sampler will sample the data with desired difficulty from the indexed data pool. Overall, this general implementation would enable users to explore curriculum learning on their workloads with maximum customizability (more technical details in our DeepSpeed Data Efficiency paper).
Figure 2: Design of the DeepSpeed Data Efficiency framework.
Using this general and extensible curriculum learning solution for GPT-3 and BERT-Large model pretraining, we are able to easily analyze and index the huge training data based on up to 7 difficulty metrics and enable better data and training efficiency. For GPT-3 pretraining, our solution with the best difficulty metric (combination of truncation-based sequence length and vocabulary rarity) achieves 1.5x data and training cost saving while still maintaining model quality as baseline (Table 1 Case (8) vs. (1)). For BERT-Large pretraining, our solution with the best difficulty metric (vocabulary rarity) achieves 1.5x saving while still maintaining model quality (Table 2 Case (8) vs. (1)). On the other hand, our solutions can further improve model quality when using the same amount of data as baseline (Table 1 Case (2) to (6), Table 2 Case (2) to (6)).
Table 1: GPT-3 1.3B pretraining data consumption and average evaluation accuracy on 19 tasks.
Table 2: BERT-Large pretraining data consumption and average GLUE finetuning score on 8 tasks.
Standard data routing usually feeds the full images/sequences into all layers of a model. However, this process may not be optimal for training efficiency since some parts of an image (or words of a sentence) do not require a frequent feature update. As such, the token dropping method has been proposed, which is illustrated in Figure 3 (b) below, to skip the compute of some tokens/words (i.e., G-2 tokens in Figure 3 (b)) of a sentence in order to save the compute cost.
Although existing methods show promising results, they also exhibit several caveats: (1) most works solely focus on BERT (encoder-only on text data) pretraining and do not include decoder pretraining and/or other modalities (e.g., images); (2) the ability to skip layers is limited, which bounds the total amount of compute saving. By analyzing existing methods, we found out the potential main issue that limits their skipping and coverage abilities is the loss of attention mechanism for G-2 tokens for all skipped layers, since multi-head attention focuses on different tokens at different layer depths and the attention map aligns with the dependency relation most strongly in the middle of transformer architectures.
To resolve this main issue, we propose random-LTD, a random and layerwise token dropping mechanism, which processes only a subset of tokens among the entire data batch for all middle layers in order to save compute cost (see more details in our Random-LTD paper). As such, each token rarely bypasses all middle layers and its dependency with other tokens can be captured by the model. The illustration of random-LTD compared to baseline is shown in Figure 3 below, where random-LTD splits the input tokens into two groups and only the first group involves the compute.
Figure 3: Comparison between baseline, existing token dropping methods, and random-LTD. Note that for random-LTD, only part of the inputs (Group 1) is used for Layer i.
Random-LTD is simple yet very effective. Particularly, compared to other existing token dropping methods, random-LTD (1) does a purely random selection for each layer for two different groups, as such we do not require any expert design for the selection criterion; (2) is able to apply to all middle layers to achieve better saving ratio; (3) demonstrates great generalizability for both encoder and decoder models; and (4) is easy to use without much modeling change. These advantages enable maximum flexibility when applying random-LTD to various workloads.
Thanks to its great flexibility, we were able to apply random-LTD method to broader applications, including BERT and GPT pretraining as well as ViT and GPT finetuning tasks. For all cases, random-LTD achieves similar model quality as baseline while using less data, and/or achieve better model quality while using the same amount of data (Table 3 to 6). For GPT-3 and BERT-Large pretraining, random-LTD achieves 1.5-2x data saving while still maintaining the same model quality. For GPT-3 we also tested random-LTD with full data which further improves the model quality compared to baseline.
Table 3: GPT-3 1.3B pretraining data consumption and average evaluation accuracy on 19 tasks.
Table 4: BERT-Large pretraining data consumption and average GLUE finetuning score on 8 tasks.
Table 5: Finetuning result of ViT on ImageNet.
Table 6: GPT-2 350M finetuning result on the PTB task.
The curriculum learning and random-LTD techniques are complementary. Inside DeepSpeed Data Efficiency framework, we seamlessly compose the two techniques as shown in Figure 2 above, where curriculum learning helps to sample the next data batch and random-LTD helps to decide how to route each sampled data inside the model. DeepSpeed Data Efficiency solves several complexities when composing the two techniques so that users can easily apply each technique or both to their training pipeline. The composability of DeepSpeed Data Efficiency also applies to data sampling and routing techniques in general, so that it provides a platform to implement and compose additional data efficiency techniques.
The composed DeepSpeed Data Efficiency solution leverages both data efficiency techniques and achieves even better data and training efficiency. Take the GPT-3 pretraining task as an example, composing CL and random-LTD, with 100% data, leads to the best model quality in our experiments (Table 7 Case (1) to (4)). When pretraining with 50% data, the baseline training results in worse zero-shot and 10-shot evaluation accuracy, and using either CL or random-LTD can only recover part of the 10-shot accuracy loss. On the other hand, the composed data efficiency solution achieves the same or better accuracy results as baseline with 100% data, demonstrating a 2x data and 2x time saving (Case (5) to (8)). Similar benefit such as 2x data saving was also observed when applying our solution to BERT pretraining.
Table 7: GPT-3 1.3B pretraining data/time consumption and average evaluation accuracy on 19 tasks.
We are very excited to share DeepSpeed Data Efficiency library with the community and improve it with your feedback. Please find the code, tutorial, and documents at the DeepSpeed GitHub, and website. And for more technical details please read our Random-LTD paper and DeepSpeed Data Efficiency paper. We believe that our composable library and novel data efficiency techniques will help users reduce training cost while maintaining model quality or achieve better quality under similar cost. And we hope DeepSpeed Data Efficiency could become a platform that motivates and accelerates future research on deep learning data efficiency.
Updated: December 11, 2022
---
## DeepSpeed Inference: Multi-GPU inference with customized inference kernels and quantization support
**URL:** https://www.deepspeed.ai/2021/03/15/inference-kernel-optimization.html
**Contents:**
- DeepSpeed Inference: Multi-GPU inference with customized inference kernels and quantization support
- Contents
- Multi-GPU Inference with Adaptive Parallelism
- Customized Inference Kernels for Boosted Compute Efficiency of Transformer Blocks
- Kernel-Fusion
- Seamless pipeline from training to inference with automatic kernel-injection
- Flexible quantization support
- Performance results
While DeepSpeed supports training advanced large-scale models, using these trained models in the desired application scenarios is still challenging due to three major limitations in existing inference solutions: 1) lack of support for multi-GPU inference to fit large models and meet latency requirements, 2) limited GPU kernel performance when running inference with small batch sizes, and 3) difficulties in exploiting quantization, which includes both quantizing the model to reduce the model size and latency as well as supporting high-performance inference of quantized models without specialized hardware.
To handle these challenges, we introduce DeepSpeed Inference, which seamlessly adds high-performance inference support to large models trained in DeepSpeed with three key features: inference-adapted parallelism for multi-GPU inference, inference-optimized kernels tuned for small batch sizes, and flexible support for quantize-aware training and inference kernels for quantized models.
Parallelism is an effective approach to fit large models and reduce per-device memory consumption for both training and inference. However, simply applying training parallelism choices and degree to inference does not work well. The MP and PP configuration is normally set during the model training, apart from the data parallelism (DP), based on the memory footprint and computation style, and resource budget. On one hand, inference computation intrinsically requires less memory, so it can afford a larger partition per device. It helps reduce the degree of parallelism needed for model deployment. On the other hand, optimizing latency or meeting latency requirements is often a first-class citizen in inference while training optimizes throughput.
To obtain desired latency, DeepSpeed Inference automatically adapts MP as an effective approach to reduce model latency, and its parallelism degree is often determined first. With MP, we can split the mode and parallelize computational operations across multiple devices (GPUs) to reduce latency, but it reduces computation granularity and increases communication that may hurt throughput. Once the latency target has been met, DeepSpeed can apply pipeline parallelism to maximize the throughput. Overall, DeepSpeed Inference supports flexible adaptation of both parallelism approach and degree choices from training to inference, minimizing latency while saving deployment costs.
To achieve high compute efficiency, DeepSpeed-inference offers inference kernels tailored for Transformer blocks through operator fusion, taking model-parallelism for multi-GPU into account. The main difference between our kernel-fusion scheme and similar approaches is that we not only fuse element-wise operations (such as bias-add, residual, and activation function), but also merge the General matrix multiply (GeMM) operations with other operations. To do this, we design an efficient implementation for the vector-matrix or skinny matrix-matrix multiplication that allows us to fuse more operations at the reduction boundary of GeMM operations.
We take two main policies for fusing operations: 1) keeping the access-pattern of inputs and outputs intact throughout the sequence of operations fused together; 2) fusing operations at each all-reduce boundary. The first policy ensures that different thread-blocks won’t encounter transferring data between Streaming-Multiprocessors (SMs). This is due to no straight-forward communication among SMs other than using the main memory which adds the block-synching overhead because of non-deterministic behavior of memory access. The reason behind the second policy is that we cannot continue the execution unless the partial results are reduced among the model-parallel GPUs.
Figure 1: Transformer Layer with Megatron-style model-parallelism all-reduce components. The figure illustrates the parts of layer fused together with broken lines (width of line shows the fusion depth).
Figure 1 shows the different components of a Transformer layer, and the groups of operations considered for fusion in our inference optimization. We also consider the NVIDIA Megatron-LM style of parallelism that partitions attention (Attn) and feed-forward (FF) blocks across multiple GPUs. Thus, we include the two all-reduce operations that reduce the results among parallel GPUs after Attn and FF blocks. As Figure 1 shows, we fuse the operations inside a Transformer layer at four main regions:
To fuse these operations, we exploit shared-memory as an intermediate cache for transferring data between reduction operations used in layer-norm and GeMM, and the element-wise operations. Moreover, we use the warp-level instructions to communicate data between threads when reducing partial computations. In addition, we use a new schedule for GeMM operations, which allows for fusing as many operations as needed for the third kernel-fusion. We also combine the GeMMs for the attention computation in the second kernel-fusion, by using an implicit matrix transformation in order to reduce the memory pressure. Compared to the unfused computation style using cuBLAS GeMM, we improve the performance by 1.5x, 2.9x. 3x, and 1.2x for all these kernel-fusions, respectively.
To run the model in Inference mode, DeepSpeed simply requires the location of the model checkpoints and the desired parallelism configuration, i.e., MP/PP degree. DeepSpeed Inference kernels can also be enabled for many well-known model architectures such as HuggingFace (Bert and GPT-2) or Megatron GPT-based models using a pre-defined policy map that maps the original parameters to the parameters in the inference kernels. For other transformer-based models, user can specify their own policy map. Note that DS-Inference can run independent of the training pipeline as long as it receives all model checkpoints, and the DeepSpeed Transformer kernels for inference can be injected into any Transformer model if the right mapping policy is defined. For more information on how to enable Transformer inference kernel as well as specifying parallelism, please refer to out inference tutorial.
To further reduce the inference cost for large-scale models, we created the DeepSpeed Quantization Toolkit, supporting flexible quantize-aware training and high-performance kernels for quantized inference.
For training, we introduce a novel approach called Mixture of Quantization (MoQ), which is inspired by mixed-precision training while seamlessly applying quantization. With MoQ, we can control the precision of the model by simulating the impact of quantization when updating the parameters at each step of training. Moreover, it supports flexible quantization policies and schedules—we find that by dynamically adjusting the number of quantization bits during training, the final quantized model provides higher accuracy under the same compression ratio. To adapt to different tasks, MoQ can also leverage the second order information of models to detect their sensitivity to precision and adjust the quantization schedule and target accordingly.
To maximize the performance gains from the quantization model, we provide inference kernels tailored for quantized models that reduce latency through optimizing data movement but do not require specialized hardware. Finally, our toolkit does not require any code changes on the client side, making it easy to use.
Boosting throughput and reducing inference cost. Figure 3 shows the inference throughput per GPU for the three model sizes corresponding to the three Transformer networks, GPT-2, Turing-NLG, and GPT-3. DeepSpeed Inference increases in per-GPU throughput by 2 to 4 times when using the same precision of FP16 as the baseline. By enabling quantization, we boost throughput further. We reach a throughput improvement of 3x for GPT-2, 5x for Turing-NLG, and 3x for a model that is similar in characteristics and size to GPT-3, which directly translates to 3–5x inference cost reduction on serving these large models. In addition, we achieve these throughput and cost improvements without compromising latency as shown in Figure 5.
Figure 3: Inference throughput for different model sizes. DeepSpeed Inference achieves 3x to 5x higher throughput than baseline.
One source of inference cost reduction is through reducing the number of GPUs for hosting large models as shown in Figure 4. The optimized GPU resources comes from 1) using inference-adapted parallelism, allowing users to adjust the model and pipeline parallelism degree from the trained model checkpoints, and 2) shrinking model memory footprint by half with INT8 quantization. As shown in this figure, we use 2x less GPUs to run inference for the 17B model size by adapting the parallelism. Together with INT8 quantization through DeepSpeed MoQ, we use 4x and 2x fewer GPUs for 17B and 175B sizes respectively.
Figure 4: Number of GPUs used for running inference on the different model sizes shown in Figure 4.
Reducing inference latency. For the application scenarios where inference latency is critical, we can increase model parallelism degree in DeepSpeed Inference to reduce inference latency further. As Figure 5 depicts, we can reduce the latency by 2.3x compared to PyTorch as we increase the model-parallelism size to 4. Furthermore, we can still have high latency improvement with a fewer number of GPUs by adapting the parallelism at inference and using MoQ to quantize the model. We obtain 1.3x and 1.9x speedups while using 4x and 2x lower resources than baseline, respectively.
For the application scenarios where inference latency is critical, we can increase model parallelism degree in DeepSpeed Inference to reduce inference latency further. As Figure 5 depicts, we can reduce the latency by 2.3x compared to PyTorch as we increase the model-parallelism size to 4. Furthermore, we can still have high latency improvement with a fewer number of GPUs by adapting the parallelism at inference and using MoQ to quantize the model. We obtain 1.3x and 1.9x speedups while using 4x and 2x lower resources than baseline, respectively.
Figure 5. Inference latency for the 17B model using different parallelism configuration to optimize latency.
Updated: March 15, 2021
---
## Inference Overview and Features
**URL:** https://www.deepspeed.ai/inference/
**Contents:**
- Inference Overview and Features
- Contents
DeepSpeed-Inference v2 is here and it’s called DeepSpeed-FastGen! For the best performance, latest features, and newest model support please see our DeepSpeed-FastGen release blog!
DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels. Finally, we propose a novel approach to quantize models, called MoQ, to both shrink the model and reduce the inference cost at production. For more details on the inference related optimizations in DeepSpeed, please refer to our blog post.
DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already loaded from a checkpoint, and DeepSpeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see here.
To get started with DeepSpeed-Inference, please checkout our tutorial.
---
## Mixture-of-Quantization: A novel quantization approach for reducing model size with minimal accuracy impact
**URL:** https://www.deepspeed.ai/2021/05/04/MoQ.html
**Contents:**
- Mixture-of-Quantization: A novel quantization approach for reducing model size with minimal accuracy impact
- Contents
- A unified suite for quantization-aware training and inference
- Quantization methodology
- Quantized Inference Kernels
- Ease of use
- Improving quantization accuracy.
Running large-scale models on multi-GPU might help reduce latency but increases the deployment cost significantly, especially as the model size grows bigger. To mitigate this issue, we resort to model compression techniques and introduce a new methodology that quantizes Transformer networks with a minimal impact on accuracy. Our technique achieves similar or better performance thanFP16 models through customized inference kernels on lower or equal number of GPUs.
Our scheme is flexible in the sense that it provides users the ability to experiment with any quantization configuration, such as the target number of bits used for quantization precision, and the scheduling by which the model gets quantized during training. Furthermore, we combine both the FP16 and quantized precision as a mixed-precision mechanism to smooth the transition from a high to low precision. Finally, we use the second-order gradient (eigenvalue) of the parameters to adjust the quantization schedule during training.
There are two main approaches of applying quantization: offline quantization on the trained model and quantization-aware training (QAT) that reduces the data-precision during training. Unlike the former scheme, QAT gets the model trained by taking the impact of precision loss into account during the training optimization. This will result in significant improvement of the quantized model accuracy. MoQ is designed on top QAT approach, with the difference that we use a mixture of precisions to train the model toward target quantization, as well as defining a scheduling for reducing the precision.
All existing QAT approaches quantize the model with a certain precision (number of bits) from the beginning of training until completion. However, even by using a relatively high quantization precision (8-bit), there will be some drop in model accuracy, which might not be acceptable for some downstream tasks. For instance, the Q8BERT work tries QAT for the BERT network, which results in good accuracy for some tasks while others (like SQuAD) lose 0.8% in the F1 score. Other techniques, such as Q-BERT, use grouped quantization with a large grouping size (128) when quantizing a parameter matrix to gain higher accuracy, but they are still inferior to the baseline.
Here, we present MoQ as a flexible solution for linear quantization that allows users to define a schedule as the model trains. Similar to iterative pruning to inject sparsity, we start quantization from a higher precision (16-bit quantization or FP16) and gradually reduce the quantization bits or the mixed-precision ratio for the FP16 part until reaching a target precision (8-bit). To control the precision transition, we define a hyperparameter, called quantization period, that indicates when the precision reduction should happen. We observe that by using such a schedule, we get the closest accuracy to the baseline. Note that in order to reach a certain precision, we need to define the starting bits and period in a way that within the number of samples to train, the model eventually gets quantized using the target number of bits. Please refer to the quantization tutorial for more information.
In order to dynamically adjust quantization precision, we employ eigenvalue as a metric that shows the sensitivity of training to the precision change. Eigenvalue has been previously used (Q-BERT) for quantization to choose the precision bits on different parts of the network. To combine this with MoQ, we cluster the eigenvalues into several regions based on their absolute values and tune the quantization period for each region accordingly, the higher the magnitude of eigenvalue, the larger the factor and the slower the precision decreases.
Figure 1. Quantization scheduling of one of the GLUE tasks (QNLI), using the eigenvalue of different layers. Different colors show the layers from 0 to 11 for Bert-Base.
Figure 1 shows the result of combining eigenvalue with MoQ for a 12-layer Bert Base model. As we see, the first few layers (0-4) tend to be more sensitive to reduced precision than the last layers, as their quantization period is an order of magnitude larger than the rest. Another observation from this figure is that the neighbor layers reduce the precision in the same way. For instance, layers 9, 10, and 11 on the left chart, and layers 0 and 4 and 1 and 3 on the right chart of Figure 1 get similar schedule. This is due to having similar eigenvalues for these layers throughout the training.
Figure 2: Mixed-precision quantization for the QNLI using target quantization period as 4 bits.
Figure 2 shows another mixed-precision quantization that sets target bits as 4, however the quantization period keeps updated through the eigenvalues of each layer. As we see, the end quantization bits are different for all layers. The first layers still get to 8-bit quantization as the training samples is not enough to decrease the quantization bits. On the other hand, the last layers keep reducing the precision. We finally reduce the average precision to 6 bits for the entire network while maintaining the accuracy of the model (0.3% drop in accuracy).
Figure 3: Mixed-precision quantization with MoQ for Bert SQuAD plus.
As another example, we use eigenvalue-based MoQ to quantize Bert-Large for SQuAD finetuning. Figure 3 shows the number of bits we get to at the end of finetuning on each layer. Here, we see slightly different precision spectrum compared to BertBase on GLUE tasks. As the figure shows, we can reduce the precision on the first few layers more aggressively than the middle ones. Also, the last few layers can tolerate very low precision similar to the beginning layers. This way of quantization finally results in 90.56 F1 Score which is pretty similar to the baseline.
By using other quantization methodologies, after the model is quantized, it can only have performance benefit if there is hardware support for integer-based operations. For this reason, the inputs and output of all GeMM operations need to be quantized. However, since the range of input may vary request by request, finding a range of data for each input at inference time is challenging. On the other hand, using a static range for all inputs can impact the inference accuracy.
To alleviate this problem, we introduce inference custom kernels that neither require the hardware support nor the input quantization. These kernels read quantized parameters and dequantize them on-the-fly and use the floating-point units of GPU cores for the GeMM operations. The main benefit of using these kernels is that they reduce the memory footprint required to load a model so that we can run inference on fewer number of GPUs, while improving the performance by saving the memory bandwidth required to run the inference on GPU.
Regarding the quantization implementation, we use different algorithms to quantize a value based on the range of data and the rounding policy. We support both symmetric and asymmetric quantization as the two mostly used schemes. We applied both techniques for QAT and see very similar results, however since symmetric approach is simpler to implement, we implement our inference kernels based on that. Regarding the rounding, we support stochastic rounding as another option besides the normal rounding. We have seen that for reducing the precision to as low as 4-bit or lower, stochastic rounding is more helpful as it has an unbiased random behavior during training.
For enabling quantization through Deepspeed, we only need to pass the scheduling through a JSON configuration file. To add the impact of quantization, we quantize and dequantize the parameters just before they are updated in the optimizer. Thus, we do not incur any change on the modeling side to quantize a model. Instead, we simulate the quantization impact by lowering the precision of data saved in FP16 format. By using this kind of implementation, we have the full flexibility of changing the precision using the training characteristics such as number of steps, and eigenvalue of the parameters and the original FP16 data format. As shown in this blog post, we can improve the quality of a quantized model by adaptively changing the scheduling of the quantization throughout training. For more information on how to use MoQ scheme, please look at our quantization tutorial.
To show how our quantization scheme preserves accuracy, we have experimented MoQ on several tasks and networks: GLUE tasks on Bert-Base and SQuAD on Bert-Large. Table 1 shows the accuracy results for the baseline without quantization (w/o Quant), basic quantization without using any scheduling during training (Basic Quant), and our MoQ scheme. Without using any scheduling, the accuracy for 8-bit quantization is often inferior to the baseline, and in this workload, it suffers from a drop of 1.02 point in accuracy (ACC). In contrast, MoQ powers 8-bit quantization to obtain comparable accuracy as the FP16 baseline, even with a slightly higher ACC, demonstrating the effectiveness of our quantization approach.
---
## DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression
**URL:** https://www.deepspeed.ai/2021/05/14/inference-release.html
**Contents:**
- DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression
- Contents
Updated: May 14, 2021
---
## Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed
**URL:** https://www.deepspeed.ai/2021/11/16/autotuning.html
**Contents:**
- Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed
We introduce a new feature called Autotuning to automatically discover the optimal DeepSpeed configuration that delivers good training speed. One pain point in model training is to figure out good performance-relevant configurations such as micro-batch size to fully utilize the hardware and achieve a high throughput number. This configuration exploring process is commonly done manually but is important since model training is repeated many times and benefits from using a good configuration. Not only is the hand-tuning process time-consuming, but the outcome is hardware-dependent. This means that a good configuration on one hardware might not be the best on another different hardware. The user thus has to hand tune the configuration again. With DeepSpeed, there are more configuration parameters that could potentially affect the training speed, thus making it more tedious to manually tune the configuration.
The DeepSpeed Autotuner mitigates this pain point and automatically discovers the optimal DeepSpeed configuration that delivers good training speed. It not only reduces the time and resources users spend on tuning, but also can discover configurations better than hand-tuned methods. DeepSpeedExamples would demonstrate the effectiveness of autotuning across different models.
Updated: November 16, 2021
---
## Contributing
**URL:** https://www.deepspeed.ai/contributing/
**Contents:**
- Contributing
- Contents
- Prerequisites
- Testing
- Unit Tests
- Model Tests
- Contributor License Agreement
- Code of Conduct
DeepSpeed welcomes your contributions!
DeepSpeed uses pre-commit to ensure that formatting is consistent across DeepSpeed. First, ensure that pre-commit is installed from either installing DeepSpeed or pip install pre-commit. Next, the pre-commit hooks must be installed once before commits can be made:
Afterwards, our suite of formatting tests run automatically before each git commit. You can also run these manually:
If a formatting test fails, it will fix the modified code in place and abort the git commit. After looking over the changes, you can git add <modified files> and then repeat the previous git commit command.
DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests. The model convergence tests train DeepSpeedExamples and measure end-to-end convergence and related metrics. Unit tests are found in tests/unit/ and the model convergence tests are found in tests/model/.
PyTest is used to execute tests. PyTest can be installed from PyPI via pip install pytest. Simply invoke pytest --forked to run the unit tests:
You can also provide the -v flag to pytest to see additional information about the tests. Note that pytest-forked and the --forked flag are required to test CUDA functionality in distributed tests.
Model tests require four GPUs and training data downloaded for DeepSpeedExamples.
To execute model tests, first install DeepSpeed. The DeepSpeedExamples repository is cloned as part of this process. Next, execute the model test driver:
Note that the --forked flag is not necessary for the model tests.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
**Examples:**
Example 1 (unknown):
```unknown
pre-commit install
```
Example 2 (unknown):
```unknown
pre-commit run --all-files
```
Example 3 (unknown):
```unknown
pytest --forked tests/unit/
```
Example 4 (unknown):
```unknown
cd tests/model/
pytest run_sanity_check.py
```
---
## Latest News
**URL:** https://www.deepspeed.ai
**Contents:**
- Latest News
- Contents
- Extreme Speed and Scale for DL Training
- DeepSpeed Adoption
- Contributing
- Contributor License Agreement
- Code of Conduct
- Publications
- Videos
[2025/10] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
[2025/10] Study of ZenFlow and ZeRO offload performance with DeepSpeed CPU core binding
[2025/08] ZenFlow: Stall-Free Offloading Engine for LLM Training
[2025/06] Arctic Long Sequence Training (ALST) with DeepSpeed: Scalable And Efficient Training For Multi-Million Token Sequences
[2025/06] DeepNVMe: Affordable I/O scaling for Deep Learning Applications
DeepSpeed enabled the world’s most powerful language models (at the time of this writing) such as MT-530B and BLOOM. DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations include ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc.
DeepSpeed has been used to train many different large-scale models. Below is a list of several examples that we are aware of (if you’d like to include your model please submit a PR):
DeepSpeed has been integrated with several different popular open-source DL frameworks such as:
DeepSpeed is an integral part of Microsoft’s AI at Scale initiative to enable next-generation AI capabilities at scale.
DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang. (2024) Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training arXiv:2406.18820
---
```