Pytorch Performance Tuning Practices

Related to: Machine Learning

개요

PyTorch 모델 학습과 추론의 성능을 향상시키기 위한 실용적인 최적화 기법들을 정리합니다.

핵심 개념

General optimizations

Use async data loading

torch.utils.data.DataLoader(dataset, num_workers=num_workers)
- num_workers = 0: 메인 프로세스가 데이터를 디스크에서 동기식으로 로딩합니다.
- num_workers > 0: 여러 프로세스를 사용하여 디스크에서 데이터를 비동기식으로 읽고, 학습과 데이터로딩이 overlapping될 수 있도록 허용합니다. CPU의 데이터 로딩을 빠르게 처리하는 용도로 사용합니다.

Pin memory, transfer data asynchronously

Untitled 37.png

torch.utils.data.DataLoader(dataset, pin_memory=True) batch.to(device, non_blocking=True)
- GPU가 pageable host 메모리에서 곧바로 데이터를 가져올 수 없기 때문에, pinned (page-locked) 메모리를 활용
- pin_memory=True: 데이터 텐서를 자동으로 pinned 메모리로 가져오기 때문에, 데이터 전송이 빠릅니다.
- pin_memory=True, non_blocking=True: pinned 메모리에 있는 데이터에 한해서 GPU로 비동기식으로 데이터를 전송합니다.

Efficiently zero-out gradients

model.zero_grad(set_to_none=True)
- model.zero_grad() 대신 사용합니다.
- 모든 파라미터마다 memset을 실행하지 않습니다.
- Gradient를 업데이트할 때, ”+=” (read+write)이 아닌 ”=” (write)를 사용합니다.

Increase batch size

배치 크기를 키워서 GPU 메모리를 최대한 활용하는 것이 학습 시간을 단축하는데 큰 도움이 됩니다.
배치 크기가 크면, 수렴이 느려질 수 있기 때문에 아래와 같은 방법을 사용해서 보완할 수 있습니다.
- Tune learning rate, tune weight decay
- Add learning rate warm-ups & decay

예시 / 코드

GPU specific optimizations

Use 16-bit precision

scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast(enabled=use_fp16):
    output = model(input)
    loss = loss_fn(output, target)
if use_fp16:
    scaler.scale(loss).backward()
    if max_norm is not None:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
    scaler.step(optimizer)
    scaler.update()

Mixed precision training은 FP16, FP32를 같이 사용해서 학습하는 방법입니다. 일반적으로 2단계로 이루어집니다.

FP16으로 casting.
FP16 숫자가 0으로 되지 않도록 loss / gradient scaling.: FP16이 나타낼 수 있는 수의 최소 범위 (2²⁴) 보다 숫자가 작아서 0으로 강제 변환하는 문제를 scaling으로 해결.

Mixed precision training 이점:

FP32으로만 학습할 때와 비슷한 정확도.
필요한 메모리 사이즈 감소.
학습 시간 감소. V100 기준, 1.5~5배 speedup.

Enable cuDNN autotuner

torch.backends.cudnn.benchmark = True

Nvidia cuDNN은 convolution (CNN)을 계산하기 위해 다양한 알고리즘을 지원하고 있습니다.
Autotuner는 짧은 benchmark 실행하고, 하드웨어와 input 크기에 최적화된 알고리즘을 선택합니다.
(주의) 고정된 input 크기일 때만 효과적이고, input 크기가 동적으로 변하면 매번 최적화된 알고리즘을 찾게 되어 시간이 더 오래 걸릴 수도 있습니다.

Avoid unnecessary CPU-GPU synchronization

# don't
.item()
.cuda()
.cpu()
.to(device)
.nonzero()
print(tensor)

불필요하게 GPU, CPU간 데이터를 전송하는 경우, 성능이 크게 저하됩니다.

Construct tensors directly on GPUs

# don't
t = tensor.rand(2,2).cuda()
 
# do
t = tensor.rand(2,2, device=torch.device('cuda:0'))

Distributed optimizations

Use DistributedDataParallel not DataParallel

Untitled 1 31.png

Spawn processes

torch.multiprocessing.spawn(main_worker, nprocs=args.gpu_num, args=(args,))

Environment variable initialization

os.environ['MASTER_ADDR'] = master_address
os.environ['MASTER_PORT'] = str(master_port)

Initialize process group

torch.distributed.init_process_group(backend='nccl',
                                     init_method='env://',
                                     world_size=world_size,
                                     rank=rank)

Distributed Data Parallel

model = torch.nn.parallel.DistributedDataParallel(
    model,
    device_ids=[gpu],
    output_device=gpu,
)

참조

Top 10 Performance Tuning Practices for Pytorch

Pytorch 모델의 학습 및 추론을 가속화 할 수 있는 10가지 팁을 공유드립니다. https://medium.com/naver-shopping-dev/top-10-performance-tuning-practices-for-pytorch-e6c510152f76

HSV

Explorer

Pytorch Performance Tuning Practices

개요

핵심 개념

General optimizations

예시 / 코드

GPU specific optimizations

Distributed optimizations

관련 개념

참조

Graph View

Table of Contents

Backlinks