深度学习编译器

  • Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks (OSDI’20) [PDF] [Code] [阅读笔记]
    • 北大、上海科技大学与MSRA合作的工作,优化 DNN inference的workload
    • 通过inter-op和intra-op的调度提高加速器的利用率
  • Roller: Fast and Efficient Tensor Compilation for Deep Learning (OSDI’22) [PDF]
    • 近年来tensor编译器有了很大的发展,但是为一个操作生成相应的kernel需要几个小时的时间。这会严重影响了DNN的开发过程。编译过程之所以慢是因为为了追求更好的效果而使用了机器学习。在这个工作中,作者另辟蹊径通过一个构造的方法生成kernel,可以在秒级别生成和SOTA性能相当的kernel。
  • SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute (OSDI’22) [PDF]
    • 本篇文章主要探索深度神经网络的稀疏性,借助名为TeSA的新抽象为模型提供端到端稀疏性模式,并根据不同的稀疏性模式,产生高效的专用的operator,从而为稀疏的DNN模型提供更小的内存占用和更低的推理延迟。

Inference

  • Swayam: Distributed Autoscaling to Meet SLAs of Machine Learning Inference Services with Resource Efficiency (Middleware’17) [PDF]
    • Meet SLA for inference jobs
  • Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (OSDI’20) [PDF] [Code]
    • 通过一种中心化的管理,更高效地使用GPU做inference
    • 提前把多个model的weight load进GPU内存里,在同一时间只做一个任务的inference;准确预测每个inference job的完成时间
  • INFaaS: Automated Model-less Inference Serving (ATC’21) [PDF]
    • 斯坦福团队的论文,主打大规模下的ease-of-use & cost efficiency
    • developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query.
  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI’22) [PDF]
    • 本项工作的作者来自首尔大学,主要设计了一个面向Transformer-based的生成模型分布式推理系统 Orca。Orca已经应用在作者所在的FriendliAI公司的真实业务中。
  • Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI’22) [PDF]

Serverless

  • PyPlover: A System for GPU-enabled Serverless Instances [PDF]
    • 加州伯克利一位学生的technical report,提出了一个无服务GPU框架PyPlover
  • Towards Demystifying Serverless Machine Learning Training (SIGMOD’21) [PDF]

Debugging

  • Amazon SageMaker Debugger: A System for Real-Time Insights into Machine Learning Model Training (MLSys’21) [PDF]
    • Amazon SageMaker Debugger可以自动捕捉【训练中】的bug,例如梯度爆炸、梯度消失、过拟合等等
      通过IR规则检查是否出现bug,从而辅助开发者debug

绿色DL

  • Carbon Emissions and Large Neural Network Training (ArXiv’21) [阅读笔记] [PDF]
    • Google和Berkley的大佬的论文
  • Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning (Journal of Machine Learning Research 2020) [PDF]
  • Treehouse: A Case For Carbon-Aware Datacenter Software [PDF]
  • Experiences in autotuning matrix multiplication for energyminimization on GPUs
  • Power and Performance Characterization ofComputational Kernels on the GPU

分布式系统

  • Time, Clocks, and the Ordering of Events in a Distributed System [阅读笔记]
    • Lamport对分布式系统中时序的定义的经典论文

An Empirical Model of Large-Batch Training


评论