Understand CUDA Unified Memory

7 minute read

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

Intro

Unified virtual memory (UVM) is introduced in CUDA since Pascal Architecture, it is designed to unified the memory of hosts (CPU) and devices (GPU) into the same address space, so that a piece of data can be accessed by any host code and kernel code.

The basic usage of UVM is already introduced in Unified Memory for CUDA Beginners and Maximizing Unified Memory Performance in CUDA. This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

The name, “unified virtual memory”, actually indicates two major features, “unified” and “virtual”. We will exploit both of them in this study.

Using CuPy library

As the goal of this study is to apply UVM in LLM, I use CuPy for fast demonstration and profiling. CUDA version might be added later.

UVM is aliased as cudaMemManaged in CUDA APIs. There are two ways of allocating managed memory with cupy.

import cupy as cp 

num_float32 = 1000
arr_dtype = cp.float32
# Method 1: using managed memory allocate API, similar to C++
memory = cp.cuda.malloc_managed(num_float32 * 4) # unit: byte
x_cp = cp.ndarray(num_float32, dtype=arr_dtype, memptr=memory)

# Method 2: set memory allocator to managed_allocator, more python style
cp.cuda.set_allocator(cp.cuda.malloc_managed)
y_cp = cp.ndarray(num_float32, dtype=arr_dtype)

To use it in PyTorch for machine learning (ML) purpose, users only need to wrap it up with tensor interface,

import torch
# in-place transform the array to a tensor
x_tensor = torch.as_tensor(x_cp, device='cuda')
# in-place transform a part of the array to a tensor
subx_tensor = torch.as_tensor(x_cp[:100], device='cuda')
print(x_tensor.size(), subx_tensor.size())

Output:

torch.Size([1000]) torch.Size([100])

You can verify they are on the same physical memory by

x_cp[0] = 100
x_tensor[1] = 11
subx_tensor[2] = 22
print(x_cp[:10])

Output:

[100.  11.  22.   0.   0.   0.   0.   0.   0.   0.]

which means any modification is done on the same piece of data.

Unified memory address

Data allocated in unified memory can be accessed seamlessly on devices and host, and users don’t need to call methods like .cuda() or .cpu() to deal with the memory transfer. Instead, the UVM engine will handle the memory transfer between devices and host triggered by page faults. TODO: add a figure of nsys. As is shown in using cupy library, the data can be accessed either as CPU memory or CUDA memory.

Besides, UVM supports oversubscription, which means the size of managed memory can goes over the size of GPU memory, and the oversized part will be stored on host. Once a GPU kernel code tries to r/w to that address, the UVM engine swap the data to device.

To test oversubscription, I write a script that keeps allocating new chunks of data until the total size of chunks allocated is twice of device memory.

import cupy as cp
import torch
import sys
import time

# get gpu memory
cp.cuda.set_allocator(cp.cuda.malloc_managed)
mem = cp.cuda.Device(0).mem_info
print(f"Avaliable: {mem[0]/1024/1024/1024}/{mem[1]/1024/1024/1024} GB")

# translate to maximum number of float32
max_float32 = mem[1] // 4
print(f"Max float32: {max_float32}")

# divide it into 4 chunks
chunk_size = mem[1] // 4

# allocate 8 chunks
num_chunks = 8
chunks = []
for i in range(num_chunks):
    chunk_cp = cp.ones(chunk_size, dtype=cp.float32)
    chunk_cp *= i # do some computation
    chunks.append(chunk_cp)
    # check memory usage
    mem = cp.cuda.Device(0).mem_info
    print(f"Avaliable: {mem[0]/1024/1024/1024}/{mem[1]/1024/1024/1024} GB")

# run another loop to check if the data is still there
for i in range(num_chunks):
    chunk_sum = chunks[i].sum()
    print(f"chunk sum: {chunk_sum} == {i*chunk_size}")

Output:


which indicates that old chunks are evicted to host memory and are swapped back when it is required.

Virtual paged memory

Compared with unified memory address, I believe this feature is not quite exploited by developers. However, I find it is very powerful thanks to its deferring mechanism, which is similar to the OS memory management unit.

According to the CUDA document, UVM is managed in pages of size 4kB, which I suppose is designed to deal with fragmentation problem. This sounds intuitive, but I found some cool application of it.

Let’s see how the memory is allocated, here I recommend NVTOP to monitor the memory usage in runtime. TODO: add a figure of nvtop

First, allocate a very large cp.ndarray

import cupy as cp
import torch
import time

cp.cuda.set_allocator(cp.cuda.malloc_managed)
x_cp = cp.ndarray((50000, 50000), dtype=cp.float32) # 10GB > 4GB
time.sleep(4) # waiting here to check nvtop

On nvtop,

TODO

the GPU memory usage is still zero percent, which indicates the memory allocation is deferred indeed.

Updated:

Comments