分布式训练框架 Reading List

  • Horovod: Fast and Easy Distributed Deep Learning in TensorFlow (ArXiv’18) [PDF] [Code]
    • 延伸阅读:Horovod - Distributed TensorFlow Made Easy [Original Link]
    • 延伸阅读:Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow [Original Link]
  • A Generic Communication Scheduler for Distributed DNN Training Acceleration (SOSP’19) [PDF] [阅读笔记]
    • 框架:ByteScheduler
    • 字节跳动设计的高效的通信调度方法
  • PyTorch Distributed: Experiences on Accelerating Data Parallel Training (VLDB ‘20) [PDF] [阅读笔记]
    • 模块:torch.nn.parallel.DistributedDataParallel
    • PyTorch DDP模块的设计、实现和评估;其中,通信和backward过程进行了overlap,且将小的tensor聚集为bucket来提高通信效率
  • A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (OSDI’20) [PDF] [Code] [阅读笔记]
    • 框架:BytePS
    • 字节跳动Byt

评论