深度学习编译器
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks (OSDI’20) [PDF] [Code] [阅读笔记]
- 北大、上海科技大学与MSRA合作的工作,优化 DNN inference的workload
- 通过inter-op和intra-op的调度提高加速器的利用率
- Roller: Fast and Efficient Tensor Compilation for Deep Learning (OSDI’22) [PDF]
- 近年来tensor编译器有了很大的发展,但是为一个操作生成相应的kernel需要几个小时的时间。这会严重影响了DNN的开发过程。编译过程之所以慢是因为为了追求更好的效果而使用了机器学习。在这个工作中,作者另辟蹊径通过一个构造的方法生成kernel,可以在秒级别生成和SOTA性能相当的kernel。
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute (OSDI’22) [PDF]
- 本篇文章主要探索深度神经网络的稀疏性,借助名为TeSA的新抽象为模型提供端到端稀疏性模式,并根据不同的稀疏性模式,产生高效的专用的operator,从而为稀疏的DNN模型提供更小的内存占用和更低的推理延迟。
Inference
- Swayam: Distributed Autoscaling to Meet SLAs of Machine Learning Inference Services with Resource Efficiency (Middleware’17) [PDF]
- Meet SLA for inference jobs
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (OSDI’20) [PDF] [Code]
- 通过一种中心化的管理,更高效地使用GPU做inference
- 提前把多个model的weight load进GPU内存里,在同一时间只做一个任务的inference;准确预测每个inference job的完成时间
- INFaaS: Automated Model-less Inference Serving (ATC’21) [PDF]
- 斯坦福团队的论文,主打大规模下的ease-of-use & cost efficiency
- developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query.
- Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI’22) [PDF]
- 本项工作的作者来自首尔大学,主要设计了一个面向Transformer-based的生成模型分布式推理系统 Orca。Orca已经应用在作者所在的FriendliAI公司的真实业务中。
- Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI’22) [PDF]
Serverless
- PyPlover: A System for GPU-enabled Serverless Instances [PDF]
- 加州伯克利一位学生的technical report,提出了一个无服务GPU框架PyPlover
- Towards Demystifying Serverless Machine Learning Training (SIGMOD’21) [PDF]
Debugging
- Amazon SageMaker Debugger: A System for Real-Time Insights into Machine Learning Model Training (MLSys’21) [PDF]
- Amazon SageMaker Debugger可以自动捕捉【训练中】的bug,例如梯度爆炸、梯度消失、过拟合等等
通过IR规则检查是否出现bug,从而辅助开发者debug
- Amazon SageMaker Debugger可以自动捕捉【训练中】的bug,例如梯度爆炸、梯度消失、过拟合等等
绿色DL
- Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning (Journal of Machine Learning Research 2020) [PDF]
- Treehouse: A Case For Carbon-Aware Datacenter Software [PDF]
- Experiences in autotuning matrix multiplication for energyminimization on GPUs
- Power and Performance Characterization ofComputational Kernels on the GPU
分布式系统
- Time, Clocks, and the Ordering of Events in a Distributed System [阅读笔记]
- Lamport对分布式系统中时序的定义的经典论文
An Empirical Model of Large-Batch Training
评论