Analyzing GPU Tensor Core Potential for Fast Reductions

EasyChair Preprint 565

5 pages•Date: October 7, 2018

Roberto A. Carrasco Cavieres, Raimundo Vega and Cristóbal A. Navarro

Abstract

The Nvidia GPU architecture has introduced new computing elements such as the tensor cores, which are special processing units dedicated to perform fast matrix-multiply- accumulate (MMA) operations and accelerate Deep Learning applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of n numbers as a set of m × m MMA tensor-core operations (for Nvidia’s Volta architecture m = 16) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of n numbers in T(n) = 5 log_m^2 (n) steps with a speedup of S = 4/5 log (m^2).

Keyphrases: GPU computing, NVIDIA Tensor Cores, matrix-multiply-accumulate, reduction

Links:	https://easychair.org/publications/preprint/n9zS
	https://doi.org/10.29007/zlmg

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:565,
  author    = {Roberto A. Carrasco Cavieres and Raimundo Vega and Cristóbal A. Navarro},
  title     = {Analyzing GPU Tensor Core Potential for Fast Reductions},
  doi       = {10.29007/zlmg},
  howpublished = {EasyChair Preprint 565},
  year      = {EasyChair, 2018}}

Download PDF Open PDF in browser