1. 简介¶
Horovod_gcu是基于开源的Horovod为 Enflame GCU专门定制的企业级版本,当前支持TensorFlow与PyTorch。Horovod_gcu的目标是支持千卡及以上规模的分布式训练集群,且性能最优、简单易用以及最佳适合Enflame GCU。项目上的实践已证明当前的Horovod_gcu 版本可以支撑 Resnet50 v1.5/v3.0 千卡训练集群获得0.90+的加速比。
2. 前置准备¶
Openmpi,当前支持的版本为4.0.5, 训练容器自带
Python >=3.6,Linux 系统
燧原TopsRider安装包里的Tensorflow版本,当前仅支持 Tensorflow v1 版
燧原TopsRider安装包里的 PyTorch_dtu版本,以及对应的PyTorch 版本
燧原TopsRider安装包里的tops-sdk
燧原TopsRider安装包里的tops-eccl
燧原TopsRider安装包里的tops-models
以上是Horovod_gcu能跑起分布式训练最基础的依赖软件包。其中,如果只跑Tensorflow的,那么可以只安装Tensorflow,只跑PyTorch那么可以只安装PyTorch+PyTorch_DTU。
3. 安装部署¶
3.1. openmpi¶
先确保使用的是燧原发布的训练专用的docker镜像,启动容器后自带openmpi-4.0.5 ,如下:
# mpirun -version
+ mpirun.real -vv --allow-run-as-root -version
mpirun.real (OpenRTE) 4.0.5
3.2. TopsRider¶
参考TopsRider的安装说明安装好: tops-sdk,tops-eccl, tensorflow/PyTorch+PyTorch_dtu,tops-models。
3.3. horovod_gcu¶
以horovod_gcu v1.1.0 , PyTorch 3.6 版本为例,其安装过程如下:
#pip3.6 install horovod_gcu-1.1.0-cp36-cp36m-linux_x86_64.whl
4. 使用方式¶
4.1. Horovod_GCU API¶
Horovod API 参考 https://horovod.readthedocs.io/en/stable/api.html
4.2. Horovod_GCU With Tensorflow¶
基于Horovod +TensorFlow编程, 需要先在训练脚本里修改以下内容:
1)初始化 hvd.init()
;
2)一个GPU对应一个进程,设置local_rank():
For TensorFlow v1:
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
3)与训练worker个数对应的学习率,当worker规模增大时,学习率也需要跟着调整;
4)调用分布式优化器封装 hvd.DistributedOptimizer
;
5)调用broadcast初始化整个集群训练卡的变量状态;
For TensorFlow v1:
当使用 `MonitoredTrainingSession` 时添加`hvd.BroadcastGlobalVariablesHook(0)` ;
当未使用 `MonitoredTrainingSession`时, 在全局变量初始化后执行 `hvd.broadcast_global_variables` 操作。
6)在 worker 0 处保存checkpoints
For TensorFlow v1:
checkpoint_dir=None` to `tf.train.MonitoredTrainingSession` if `hvd.rank() != 0`.
7)TensorFlow v1 示例
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
4.3. Horovod with PyTorch¶
基于Horovod +PyTorch编程, 需要先在训练脚本里修改以下内容
1)初始化hvd.init()
.
2)一个GPU对应一个进程,设置local_rank():
if torch.cuda.is_available():
torch.cuda.set_device(hvd.local_rank())
3)与训练worker个数对应的学习率,当worker规模增大时,学习率也需要跟着调整;
4)调用分布式优化器封装 hvd.DistributedOptimizer
;
5)调用broadcast初始化整个集群训练卡的变量状态;
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)
6)在 worker 0 处保存checkpoints
7)PyTorch 示例
Example (also see a full training example):
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())
# Define dataset...
train_dataset = ...
# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
# Build model...
model = ...
model.cuda()
optimizer = optim.SGD(model.parameters())
# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
epoch, batch_idx * len(data), len(train_sampler), loss.item()))
5. 运行示例¶
5.1. Tensorflow¶
# run.sh 是一个模型训练脚本
mpirun -np 8 --allow-run-as-root --bind-to numa bash ./run.sh
5.2. PyTorch¶
# train.py 是训练脚本
horovodrun -np 4 python ./train.py
6. Q&A¶
Q:Horovod_gcu 与开源版Horovod的差异是什么?
A: horovod_gcu是专门给enflame gcu定制的最佳实践版本,具有更好的性能且是在千卡规模集群实践过的。
Q:Horovod_gcu 遵循的协议是什么?
A:遵循与开源版Horovod一样的Apache 2.0协议,见:
https://github.com/horovod/horovod/blob/master/LICENSE