Runtime comparison of asynchronous and synchronous data transfer for reduction kernel.
GPU microbenckmarking was used to identify group of threads in a Warp.
CUDA Kernel development and optimization for euclidean distance matrix calculations.
CUDA implementation of a three layer neural network was compared with Pytorch implementation.