[XPU] Support XCCL on deepspeed side by yisustc · Pull Request #7113 · deepspeedai/DeepSpeed

yisustc · 2025-03-06T07:25:51Z

XCCL will be used for XPU device in the future, we will also reserve the old path for torch-CCL enable.

loadams · 2025-03-06T16:24:27Z

Hi @ys950902 - thanks for fixing the DCO errors. Could you run the pre-commit formatter on the PR? pre-commit run --all-files? That will fix the formatting issues.

For the no-torch and python tests, I believe we are running those tests with too old of a version of torch, before xpu support was added there. We can try to update these tests, but should we check has_attr xpu first?

loadams · 2025-03-17T16:08:27Z

@ys950902 - let us know if you want to discuss more on the backwards compat or have questions on the no-torch test failures.

delock · 2025-04-03T09:43:20Z

Hi @ys950902 can you take a look at no-torch failures?

loadams · 2025-05-19T17:39:30Z

Hi @ys950902 can you take a look at no-torch failures?

@ys950902 - reminder on the no-torch failure here.

Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@oelayan7

Reapply deepspeedai#6846. FYI @oelayan7 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

… forked hangs (deepspeedai#7131) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@muellerzr

Support training multiple models, such as in [HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model) Here is some update on supporting multiple DS engines with single loss.backward(). The main message is that I think we can support this. First, some context. Backward pass in ZeRO is complicated because the optimizations/features require special handling of gradients, such as: 1. Gradient partitioning 2. Overlapping backward and reduction 3. Upcasting for fp32 grad accumulation So, we created engine.backward(loss) as a wrapper function to provide us fine-grained control over backward as below ```python def backward(loss): backward_prologue() # setup logic for special gradient handling loss.backward() backward_epilogue() # cleanup/teardown logic ``` As demonstrated by @muellerzr, this approach breaks down when loss originates from multiple DS engines. Our proposed solution is to use backward hooks on the module to launch backward_prologue() and backward_epilogue() . Specifically, 1. backward pre hook on engine.module to launch backward_prologue() before any module gradient is created. 2. backward post hook on engine.module to launch backward_epilogue() after all module gradients are created. We plan for this solution to preserve BC, i.e., engine.backward() will remain correct for single engine scenarios. The current status is that (1) is completed, while (2) is in progress. To unblock e2e testing for multi-engine scenarios, since there are probably other issues, we have a temporarily added engine._backward_prologue() . You can try this out via the following artifacts. 1. Simple multi-engine test code: https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py 2. DS branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

…i#7135) Copy changes from deepspeedai/DeepSpeed-MII#558. Fixes issue where docs still referenced CLA. --------- Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Fix deepspeedai#7132 Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Keeps lines within PEP 8 length limits. Enhances readability with a single, concise expression. Preserves original functionality. --------- Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Liang Cheng <astarxp777@gmail.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: A-transformer <astarxp777@gmail.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Unpin transformers version for all workflows except `nv-torch-latest-v100` as this still has a tolerance issue with some quantization tests. Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Resolves deepspeedai#6997 This PR conditionally quotes environment variable values—only wrapping those containing special characters (like parentheses) that could trigger bash errors. Safe values remain unquoted. --------- Signed-off-by: Saurabh <saurabhkoshatwar1996@gmail.com> Signed-off-by: Saurabh Koshatwar <saurabhkoshatwar1996@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Correct the BACKWARD_PREFETCH_SUBMIT mismatch FORWARD_PREFETCH_SUBMIT = 'forward_prefetch_submit' --------- Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Signed-off-by: yisheng <yi.sheng@intel.com>

…Tests (deepspeedai#7146) Enhancing ci/nightly coverage for gaudi2 device Tests added : test_autotp_training.py test_ulysses.py test_linear::TestLoRALinear and test_linear::TestBasicLinear test_ctx::TestEngine these provide coverage for model_parallesim and linear feature. The tests are stable. 10/10 runs pass. New tests addition is expected to increase ci time by 3-4 mins and nightly job time by 15 min. Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: yisheng <yi.sheng@intel.com>

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@tjruwase

@tjruwase Don't merge yet, I will leave a comment when it is ready for merge. Thank you. --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

…epspeedai#7158) This PR is a continuation of the efforts to improve DeepSpeed performance when using PyTorch compile. Dynamo breaks the graph because `flat_tensor.requires_grad = False`: * Is a side-effecting operation on tensor metadata * Occurs in a context where Dynamo expects static tensor properties for tracing `flat_tensor.requires_grad` is redundant and can be safely removed because: * `_allgather_params()` function is already decorated with `@torch.no_grad()` which ensures the desired property * `flat_tensor` is created using the `torch.empty()` which sets the `requires_grad=False` by default. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

ZeRO3 requires explicit cleaning in tests when reusing the environment. This PR adds `destroy` calls to the tests to free memory and avoid potential errors due to memory leaks. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: c8ef <c8ef@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

A user couldn't override `seq_parallel_communication_data_type` because of a typo in a name, this PR fixes it. Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

# Background and rationale In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in [Attention is all you need](https://arxiv.org/abs/1706.03762), section 5.1: > Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. Dynamic batch sizes has been requested in [DeepSpeed issue 1051](deepspeedai#1051), [DeepSpeed issue 3455 ](deepspeedai#3455), [Pytorch Lightning issue 16914](Lightning-AI/pytorch-lightning#16914), [huggingface issue 2647](huggingface/accelerate#2647) and is available already in many libraries e.g. [NVIDIA Triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) and [Meta FairSeq](https://github.com/facebookresearch/fairseq) (implementation [here](https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/fairseq/data/fairseq_dataset.py#L104) ). The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a `BxTxE` (Batch x Time x Embedding) -shaped input should ideally have high `B` and low `T` at the early curriculum steps (many short sentences packed together as a batch), and low `B` and high `T` at the late steps (few long sentences in the batch). A dynamic size `T` is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's [reset_activation_shape()](https://deepspeed.readthedocs.io/en/stable/pipeline.html#deepspeed.runtime.pipe.engine.PipelineEngine.reset_activation_shape): > For curriculum learning that changes the seqlen of each sample, we need to call this whenever the seqlen is going to change. However, dynamic `B` is not supported. A dynamic `B` would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as: 1. Linear Scaling Rule: "When the minibatch size is multiplied by k, multiply the learning rate by k", as in [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al.](https://arxiv.org/abs/1706.02677) 2. Square Root scaling: "when multiplying the batch size by k, multiply the learning rate by √k, to keep the variance in the gradient expectation constant" by [One weird trick for parallelizing convolutional neural networks, A. Krizhevsky et al.](https://arxiv.org/abs/1404.5997) In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config. # Illustration of dynamic batch size, sequence length and LR Imagine we picked a limit of `30` tokens per batch, and have set a reference `lr=1e-3` for a `train_batch_size=2` (in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.: ![dynamic_batch_size_and_lr](https://github.com/microsoft/DeepSpeed/assets/150697676/324bda09-8f0b-430c-bb33-cc1bd01c3fe7) Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then `10` and `4` on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become `5e-3` and `2e-3`. # Pipeline parallelism Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as `engine.reset_activation_shape()` is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similar `BxTxE` between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right): ![dynamic_batch_size_and_lr_microbatching](https://github.com/microsoft/DeepSpeed/assets/150697676/3fed5e1c-f2f5-4efe-a9c5-5b5e20719d45) We can see that the pipeline use case (right) has the same `BxTxE` shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size) # Attention Head For an input of size `BxTxE` the attention has a shape of `TxT` for a mask of fixed size across samples of same size, or `BxTxT` for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as: ![dynamic_batch_size_and_lr_attn_matrix](https://github.com/microsoft/DeepSpeed/assets/150697676/707d2f17-66da-4034-8a12-a87df2044bfb) Note the memory savings: the attention head has a size of `BxTxT`, i.e. a linear memory dependency on the batch size `B` and quadratic memory dependency on the largest sequence length `T` in the (micro-) batch. Thus, supporting a dynamic size `T` allows for an increase of `B`. # PR overview This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling `get_dataloader_and_lr_scheduler_for_variable_batch_size`. A small explanation of that function follows: - The logic behind the algorithms for LR scaling is in `scale_lr`; - The partitioning of samples into batches is done by `batch_by_seqlen`. - For pipeline parallelism, it is required that all micro-batches in a pipeline pass to have the same activation shapes. This is enabled by setting to `True` the following parameters: - `required_microbatches_of_same_sizes` that will force the `B` dimension to be the same across all gradient accumulation steps of all dataloaders on a batch; - `required_microbatches_of_same_lengths` that will force the `T` dimension to be the same across all gradient accumulation steps. Works by calling the user-provided `sample_padding_fn(sentence, len)` that pads a given sentence to the argument length; - `batch_by_seqlen` returns `microbatch_sample_ids` (the list of sample ids per micro-batch), `batch_sizes` (the size of effective batch sizes, and `batch_max_seqlens` (longest sequence across all microbatches in a batch) - `dataloader_for_variable_batch_size` relies on `microbatch_sample_ids` and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches; - `lr_scheduler_for_variable_batch_size` relies on `batch_sizes` to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc). - Special note to the `lr_scheduler` returned that will either accept either: 1. an user-provided `Optimizer` that will scale the learning rates (in param groups) at every batch, or 2. an user-defined `LRScheduler`, that in this case will first get the learning rate from the scheduler and then scale it accordingly. # Example An example for the use case with and without pipelining is provided in file [`DeepSpeedExamples/training/data_efficiency/variable_batch_size_and_lr/variable_batch_size_and_lr_example.py`](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/data_efficiency/variable_batch_size_and_lr). The example shows an attention head with attention of variable-sized `BxTxT` per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (see `feedforward: needs to convert BxTxE to BxMxE by padding extra tokens` in the code). # Config The example file also comments the relevant deepspeed config with comments: ```python config = { "train_batch_size": 16, # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together. # I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`. "train_micro_batch_size_per_gpu": 2, "data_efficiency": { "enabled": True, # seed to be applied to all data efficiency modules, including dynamic batching "seed": 42, "data_sampling": { "num_workers": 0, # dataloader num_workers argument "pin_memory": False, # dataloader pin_memory argument "dynamic_batching": { # enables or disables dynamic batching "enabled": True, # how many tokens we need to fill a pack of sequences (that will be collated together as a sample) "max_tokens": 100, # Input and output write to read from or write the length of every sequence. # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs. "metrics_path": "./curriculum_output/", # As batch size increases/decreses, which method to use to scale LR accordingly? # Options: linear, sqrt (square root), or None to disable "lr_scaling_method": "linear", # how to pick sentences to be packed into samples: # - dataloader: by same order as they come in with the dataloader # - seqlen: by sequence length (shortest to longest) # - random: random order using the seed in config['data_efficiency']['seed' "sentence_picking_order": "dataloader", # "random" / "seqlen" / "dataloader" # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded. "min_batch_size": 1, # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded. "max_batch_size": 10, # enable the output of microbatching information about sentence packing "verbose": True, }, }, }, } ``` # Future work A follow-up PR will enable dynamic batching when calling `deepspeed.initialize`. I.e. instead of this: ```python engine, _, _, _ = deepspeed.initialize(config=config, model=model) dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...) engine.lr_scheduler = lr_scheduler ``` we'd ideally have this: ```python engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model) ``` where `initialize` will call internally `get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed`. --------- Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.5 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

We should use `torch.utils.cpp_extension.ROCM_HOME` for ROCm pytorch. ```log Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "DeepSpeed/setup.py", line 195, in <module> builder.hipify_extension() File "DeepSpeed/op_builder/builder.py", line 750, in hipify_extension header_include_dirs=self.include_paths(), ^^^^^^^^^^^^^^^^^^^^ File "DeepSpeed/op_builder/dc.py", line 32, in include_paths return ['csrc/includes', os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen posixpath>", line 76, in join TypeError: expected str, bytes or os.PathLike object, not NoneType ``` Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: yisheng <yi.sheng@intel.com>

Similar to deepspeedai#7211 When the optimizer is not specified, the optimizer will be type `DeepSpeedZeRoOffload` instead of `DeepSpeedZeroOptimizer_Stage3` (e.g. for ZeRO-3 pure inference), while `DeepSpeedZeRoOffload` doesn't have `parameter_offload`. https://github.com/deepspeedai/DeepSpeed/blob/56005d2b256eb81a88cba0a1984375f9663a3110/deepspeed/runtime/engine.py#L1684-L1707 ```log File "deepspeed/runtime/engine.py", line 3919, in compile backend = init_z3(self, backend, compile_config, compile_kwargs, schedule) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/compile/init_z3.py", line 36, in init_z3 optimizer.parameter_offload._remove_module_hooks() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'parameter_offload' ``` --------- Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Torch loads and hipify JIT C++ extension by determining whether CUDA headers and libraries are added to the build, based on the existence of `.cu` or `.cuh` in `sources`, if we let `with_cuda` to be the default `None`. https://github.com/pytorch/pytorch/blob/2a909cab1699e2be26fc7d01c7c2d20c726e1be6/torch/utils/cpp_extension.py#L1623-L1627 While for some Ops, such as DeepCompile, there are no `.cu` or `.cuh` files in the sources, but we still need to do the hipify on AMD as it includes several CUDA headers in the C++ code. So, it's better for us to control this behavior if it's not `build_for_cpu`, otherwise, the hipify will get skipped. Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: yisheng <yi.sheng@intel.com>

…pspeedai#7227) Resolves deepspeedai#7223 When DeepCompile is enabled in ZeRO-3, contiguous_grad_buffer is released, so we should check and make sure it's not None before we continue. https://github.com/deepspeedai/DeepSpeed/blob/227a60c0c412ddf4619401b5d8d9d1674aee17b5/deepspeed/compile/init_z3.py#L22-L25 Signed-off-by: Hollow Man <hollowman@opensuse.org> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.7 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Add a sentence to DeepCompile blog to recommend using the latest version. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: c8ef <c8ef@outlook.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Add cpu accelerator fp16 dtype support --------- Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

This reverts commit 00b5678. Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

I make the sentence look more human, not robot. Signed-off-by: yisheng <yi.sheng@intel.com>

some systems seem not to have the __nv_bfloat162 definition so a placeholder was introduced. newer CUDA libs have that definition, which breaks the compile process. this patch adds the official cuda_bf16.h guard while keeping the old code and a safety assert in case the definition should change in the future. see deepspeedai#7190 for reference --------- Signed-off-by: LosCrossos <165311345+loscrossos@users.noreply.github.com> Signed-off-by: LosCrossos <165311345+mytait@users.noreply.github.com> Co-authored-by: LosCrossos <165311345+mytait@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

adding `Makefile` with `make format` and `make test` to make things easier to maintain. --------- Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: yisheng <yi.sheng@intel.com>

@mrwyattii

This PR addresses this issue deepspeedai#7236. I might have reverted some of the recent changes introduced in this [PR](deepspeedai#6932), which was necessary to remove a misaligned address issue on the CUDA kernel. I will get back to this and try to make the necessary changes for the other pass. cc: @mrwyattii @jeffra --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com> Co-authored-by: Reza Yazdani <rezay@microsoft.com> Co-authored-by: Jeff Rasley <jeffra45@gmail.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

@Liangliang-Ma

cc: @Liangliang-Ma --------- Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

…7288) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: yisheng <yi.sheng@intel.com>

This PR rollback deepspeedai#6726 which caused deepspeedai#7116 . --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Signed-off-by: yisheng <yi.sheng@intel.com>

yisustc force-pushed the sy/xccl_enable branch from c453611 to 67c80d5 Compare March 6, 2025 07:26

yisustc force-pushed the sy/xccl_enable branch 2 times, most recently from 93e0e15 to f1f1cd7 Compare March 10, 2025 06:31

yisustc and others added 23 commits May 21, 2025 10:31

support XCCL on deepspeed side

51f174e

Signed-off-by: yisheng <yi.sheng@intel.com>

fix keep_module_on_host (deepspeedai#7112)

0d64032

Reapply deepspeedai#6846. FYI @oelayan7 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest…

64791b9

… forked hangs (deepspeedai#7131) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Update CONTRIBUTING.md to reflect changes from CLA to DCO (deepspeeda…

0f27e9c

…i#7135) Copy changes from deepspeedai/DeepSpeed-MII#558. Fixes issue where docs still referenced CLA. --------- Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Avoid missing attr error (deepspeedai#7133)

2e60410

Fix deepspeedai#7132 Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

fix leak of z3 buffer

7da6def

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

[NFC] Typo fix in SP layer. (deepspeedai#7152)

6fc960c

Signed-off-by: c8ef <c8ef@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Link AutoTP blog in the front page (deepspeedai#7167)

86f2e31

Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

fix seq_parallel_communication_data_type constant. (deepspeedai#7175)

39a219a

A user couldn't override `seq_parallel_communication_data_type` because of a typo in a name, this PR fixes it. Signed-off-by: yisheng <yi.sheng@intel.com>

Fix typos in GDS blog (deepspeedai#7177)

86c1d9d

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Update version.txt after 0.16.5 release (deepspeedai#7180)

1b0f96f

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.5 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

HollowMan6 and others added 21 commits May 21, 2025 10:31

Update version.txt after 0.16.7 release (deepspeedai#7232)

22b46cf

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.7 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Recommend using latest (deepspeedai#7233)

d87acac

Add a sentence to DeepCompile blog to recommend using the latest version. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

[NFC] Fix comment related to SP group (deepspeedai#7234)

e0ee4ea

Signed-off-by: c8ef <c8ef@outlook.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Add cpu accelerator fp16 dtype support (deepspeedai#7207)

0343a57

Add cpu accelerator fp16 dtype support --------- Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Update torch cpu test version

49c6937

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Revert "Update torch cpu test version"

862d4a2

This reverts commit 00b5678. Signed-off-by: yisheng <yi.sheng@intel.com>

Update CPU torch version to 2.7 (deepspeedai#7241)

86f44d4

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Update README.md (deepspeedai#7246)

7a81da3

I make the sentence look more human, not robot. Signed-off-by: yisheng <yi.sheng@intel.com>

add Makefile to ease maintenance (deepspeedai#7267)

8a0e979

adding `Makefile` with `make format` and `make test` to make things easier to maintain. --------- Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: yisheng <yi.sheng@intel.com>

[XPU] update xpu-max1100 CI workflow to torch 2.7 (deepspeedai#7284)

f407e05

Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Fix issues XPU tests hit with extra-index-url (deepspeedai#7291)

ad4dc62

cc: @Liangliang-Ma --------- Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com>

Update patch version after 0.16.8 release (deepspeedai#7296)

e702877

Signed-off-by: yisheng <yi.sheng@intel.com>

fix non-torch failure, if the torch version is too old

49e1407

Signed-off-by: yisheng <yi.sheng@intel.com>

yisustc force-pushed the sy/xccl_enable branch from 0dda210 to 49e1407 Compare May 21, 2025 02:32

yisustc requested review from GuanhuaWang, hwchen2017, jomayeri, loadams, tjruwase and tohtana as code owners May 21, 2025 02:32

yisustc mentioned this pull request May 21, 2025

[XPU] Support XCCL on deepspeed side #7299

Merged

yisustc closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] Support XCCL on deepspeed side#7113

[XPU] Support XCCL on deepspeed side#7113
yisustc wants to merge 59 commits into
deepspeedai:masterfrom
yisustc:sy/xccl_enable

yisustc commented Mar 6, 2025

Uh oh!

loadams commented Mar 6, 2025

Uh oh!

loadams commented Mar 17, 2025

Uh oh!

delock commented Apr 3, 2025 •

edited

Loading

Uh oh!

loadams commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

yisustc commented Mar 6, 2025

Uh oh!

loadams commented Mar 6, 2025

Uh oh!

loadams commented Mar 17, 2025

Uh oh!

delock commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loadams commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

delock commented Apr 3, 2025 •

edited

Loading