Profile CUDA program with Nsight

less than 1 minute read

Placeholder for my blog

Twitter Facebook LinkedIn

Comments

Custom Gather-scatter Operator by CUTLASS

17 minute read

This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...

Compact Inference with CUDA graph and StaticCache

11 minute read

This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implem...

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

20 minute read

This post is to log my implementation of gather-and-scatter matrix multiplication operation with Triton. Click here to jump to the final implementation code.

Understand CUDA Unified Memory

7 minute read

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

Xueshen Liu

Comments

You May Also Enjoy

Custom Gather-scatter Operator by CUTLASS

Compact Inference with CUDA graph and StaticCache

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

Understand CUDA Unified Memory