1. 简介

Horovod_gcu是基于开源的Horovod为 Enflame GCU专门定制的企业级版本,当前支持TensorFlow与PyTorch。Horovod_gcu的目标是支持千卡及以上规模的分布式训练集群,且性能最优、简单易用以及最佳适合Enflame GCU。项目上的实践已证明当前的Horovod_gcu 版本可以支撑 Resnet50 v1.5/v3.0 千卡训练集群获得0.90+的加速比。

2. 前置准备

  • Openmpi,当前支持的版本为4.0.5, 训练容器自带

  • Python >=3.6,Linux 系统

  • 燧原TopsRider安装包里的Tensorflow版本,当前仅支持 Tensorflow v1 版

  • 燧原TopsRider安装包里的 PyTorch_dtu版本,以及对应的PyTorch 版本

  • 燧原TopsRider安装包里的tops-sdk

  • 燧原TopsRider安装包里的tops-eccl

  • 燧原TopsRider安装包里的tops-models

以上是Horovod_gcu能跑起分布式训练最基础的依赖软件包。其中,如果只跑Tensorflow的,那么可以只安装Tensorflow,只跑PyTorch那么可以只安装PyTorch+PyTorch_DTU。

3. 安装部署

3.1. openmpi

先确保使用的是燧原发布的训练专用的docker镜像,启动容器后自带openmpi-4.0.5 ,如下:

# mpirun -version
+ mpirun.real -vv --allow-run-as-root -version
mpirun.real (OpenRTE) 4.0.5

3.2. TopsRider

参考TopsRider的安装说明安装好: tops-sdk,tops-eccl, tensorflow/PyTorch+PyTorch_dtu,tops-models。

3.3. horovod_gcu

以horovod_gcu v1.1.0 , PyTorch 3.6 版本为例,其安装过程如下:

#pip3.6 install horovod_gcu-1.1.0-cp36-cp36m-linux_x86_64.whl

4. 使用方式

4.1. Horovod_GCU API

Horovod API 参考 https://horovod.readthedocs.io/en/stable/api.html

4.2. Horovod_GCU With Tensorflow

基于Horovod +TensorFlow编程, 需要先在训练脚本里修改以下内容:

1)初始化 hvd.init();

2)一个GPU对应一个进程,设置local_rank():

For TensorFlow v1:

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

3)与训练worker个数对应的学习率,当worker规模增大时,学习率也需要跟着调整;

4)调用分布式优化器封装 hvd.DistributedOptimizer;

5)调用broadcast初始化整个集群训练卡的变量状态;

For TensorFlow v1:

当使用  `MonitoredTrainingSession` 时添加`hvd.BroadcastGlobalVariablesHook(0)` ;
当未使用 `MonitoredTrainingSession`时, 在全局变量初始化后执行 `hvd.broadcast_global_variables` 操作。

6)在 worker 0 处保存checkpoints

For TensorFlow v1:

checkpoint_dir=None` to `tf.train.MonitoredTrainingSession` if `hvd.rank() != 0`.

7)TensorFlow v1 示例

import tensorflow as tf
import horovod.tensorflow as hvd


# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

4.3. Horovod with PyTorch

基于Horovod +PyTorch编程, 需要先在训练脚本里修改以下内容

1)初始化hvd.init().

2)一个GPU对应一个进程,设置local_rank():

if torch.cuda.is_available():
    torch.cuda.set_device(hvd.local_rank())

3)与训练worker个数对应的学习率,当worker规模增大时,学习率也需要跟着调整;

4)调用分布式优化器封装 hvd.DistributedOptimizer;

5)调用broadcast初始化整个集群训练卡的变量状态;

hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

6)在 worker 0 处保存checkpoints

7)PyTorch 示例

Example (also see a full training example):

import torch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))

5. 运行示例

5.1. Tensorflow

# run.sh 是一个模型训练脚本
mpirun -np 8 --allow-run-as-root --bind-to numa bash ./run.sh

5.2. PyTorch

# train.py 是训练脚本
horovodrun -np 4 python ./train.py

6. Q&A

Q:Horovod_gcu 与开源版Horovod的差异是什么?

A: horovod_gcu是专门给enflame gcu定制的最佳实践版本,具有更好的性能且是在千卡规模集群实践过的。

Q:Horovod_gcu 遵循的协议是什么?

A:遵循与开源版Horovod一样的Apache 2.0协议,见:

https://github.com/horovod/horovod/blob/master/LICENSE