Recent Posts

Custom Gather-scatter Operator by CUTLASS

19 minute read

This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...

Understand CUDA Unified Memory

7 minute read

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).