Download PDFOpen PDF in browser

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

EasyChair Preprint no. 6346

16 pagesDate: August 21, 2021


In deep learning pipelines, we demonstrate the performance benefits and tradeoffs of combining two convolution layers into a single layer on multicore CPUs. We analyze when and why fusion may result in runtime speedups, and study three types of layer fusion: (a) 3-by-3 depthwise convolution with 1-by-1 convolution, (b) 3-by-3 convolution with 1-by-1 convolution, and (c) two 3-by-3 convolutions. We show that whether fusion is beneficial is dependent on numerous factors, including arithmetic intensity, machine balance, memory footprints, memory access pattern, and the way the output tensor is tiled. We devise a schedule for all these fusion types to automatically generate fused kernels for multicore CPUs through auto-tuning. With more than 30 layers extracted from five CNNs, we achieve a 1.04x geomean with 1.44x max speedup against separate kernels from MKLDNN, and a 1.24x geomean with 2.73x max speed up against AutoTVM-tuned separate kernels in standalone kernel benchmarks. We also show a 1.09x geomean with 1.29x max speedup against TVM, and a 2.09x geomean with 3.35x max speedup against MKLDNN-backed PyTorch, in end-to-end inference tests.

Keyphrases: auto-tuning, CNN, Layer Fusion, multicore CPUs

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Zhongyi Lin and Evangelos Georganas and John D. Owens},
  title = {Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs},
  howpublished = {EasyChair Preprint no. 6346},

  year = {EasyChair, 2021}}
Download PDFOpen PDF in browser