Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang et al. (11 total)
2022-06-30
SC22: International Conference for High Performance Computing, Networking, Storage and Analysis
10.1109/SC41404.2022.00051
494 citations
摘要
The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference s...
上传此论文
拖拽PDF文件或点击选择文件
需要全文访问?
您可以请求该论文或上传您自己的副本,以访问包含翻译、AI摘要和对话功能的全文。