Optimizing matrix multiplication on NVIDIA GPU.

Project Overview

Matrix multiplication optimization that achieves 86.5 percent of CUBLAS performance.

Skills Used

CUDA Programming. C++/C Programming. Shared and Global memory. CUDA Warps Asynchronous data transfer.

Introduction.

Memory bound vs Compute bound for Matrix Multiplication.

Kernel 1: Naive implementation.

Kernel 2: .

Kernel 3: Shared Memory Cache-Blocking.

Kernel 4: .

Kernel 5: .

Kernel 6: .

Conclusion

This technical blog discussed step by step on how to optimize kernel optimization for matrix multiplication of NVIDIA GPU.

References