GPU共享
- AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI’20) [阅读笔记)] [PDF] [Code]
- GPU集群中内存和op的细粒度调度,可以提升集群的利用率
- Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC’21)
Memory Swap
- Doing More with Less: Training Large DNN Models on Commodity Servers for the Masses (HotOS’21) [PDF]
- 提出了Harmony解决如何在小型服务器上训练大型模型的问题
- key idea:把数据、模型和opdecompose为小的task,细粒度地调度这些小的task,并在后期再将这些task的计算和swaps与device bind起来;目标:训练大型模型的效率最大化;四个原则:1.memory swap尽量小;2.just-IN-TIME调度task;3.使用P2P swap;4.balance compute、swap load
- ZeRO-Offload: Democratizing Billion-Scale Model Training (ATC’20)
- Efficient Memory Management for GPU-based Deep Learning Systems
- Salus: Fine-grained GPU Sharing Primitives for Deep Learning Applications (MLSys’20) [PDF]
评论