You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everyone,
I am trying to do training on 4 GPUs via Megatron-DeepSpeed repo using script: examples_deepspeed/finetune_hf_llama with TP=4, PP=1
Now I am profiling communication calls and I notice that for huggyllama/llama-7b just before optimizer is called there is an allreduce call of around 3.14 GB which seems to be the total weight gradients computed in the preceding forward/backward stages. I am using ZeRO = 0, and disabled all other parallelism except TP=4.
Can someone help in understanding the requirement of that all reduce call as I expected that during backward each GPU is computing gradient for its own share of weights and update those weights only without the need to all reduce the gradients to other GPUs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I am trying to do training on 4 GPUs via Megatron-DeepSpeed repo using script: examples_deepspeed/finetune_hf_llama with TP=4, PP=1
Now I am profiling communication calls and I notice that for huggyllama/llama-7b just before optimizer is called there is an allreduce call of around 3.14 GB which seems to be the total weight gradients computed in the preceding forward/backward stages. I am using ZeRO = 0, and disabled all other parallelism except TP=4.
Can someone help in understanding the requirement of that all reduce call as I expected that during backward each GPU is computing gradient for its own share of weights and update those weights only without the need to all reduce the gradients to other GPUs.
Beta Was this translation helpful? Give feedback.
All reactions