GPU共享

  • AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI’20) [阅读笔记)] [PDF] [Code]
    • GPU集群中内存和op的细粒度调度,可以提升集群的利用率
  • Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC’21)

Memory Swap

  • Doing More with Less: Training Large DNN Models on Commodity Servers for the Masses (HotOS’21) [PDF]
    • 提出了Harmony解决如何在小型服务器上训练大型模型的问题
    • key idea:把数据、模型和opdecompose为小的task,细粒度地调度这些小的task,并在后期再将这些task的计算和swaps与device bind起来;目标:训练大型模型的效率最大化;四个原则:1.memory swap尽量小;2.just-IN-TIME调度task;3.使用P2P swap;4.balance compute、swap load
  • ZeRO-Offload: Democratizing Billion-Scale Model Training (ATC’20)
  • Efficient Memory Management for GPU-based Deep Learning Systems
  • Salus: Fine-grained GPU Sharing Primitives for Deep Learning Applications (MLSys’20) [PDF]

评论