Introduction.
Memory bound vs Compute bound for Matrix Multiplication.
Kernel 1: Naive implementation.
Kernel 2: .
Kernel 3: Shared Memory Cache-Blocking.
Kernel 4: .
Kernel 5: .
Kernel 6: .
Conclusion
This technical blog discussed step by step on how to optimize kernel optimization for matrix multiplication of NVIDIA GPU.