Related to: MLOps

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[44], line 1
----> 1 run_training()
 
Cell In[43], line 38, in run_training()
     26 trainer = pl.Trainer(
     27     # gpus=1,
     28     val_check_interval=0.5,
   (...)
     34     precision=Config.PRECISION, accelerator="gpu" 
     35 )
     37 print("Running trainer.fit")
---> 38 trainer.fit(audio_model, train_dataloaders = dl_train, val_dataloaders = dl_val)                
     40 gc.collect()
     41 torch.cuda.empty_cache()
 
File ~/.local/share/virtualenvs/Competition.Kaggle.BirdCLEF2023-BGaCen5g/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:520, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    518 model = _maybe_unwrap_optimized(model)
    519 self.strategy._lightning_module = model
--> 520 call._call_and_handle_interrupt(
    521     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    522 )
 
File ~/.local/share/virtualenvs/Competition.Kaggle.BirdCLEF2023-BGaCen5g/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
...
---> 66     _error_if_any_worker_fails()
     67     if previous_handler is not None:
     68         assert callable(previous_handler)
 
RuntimeError: DataLoader worker (pid 6865) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

local 환경의 docker 내부에서 model 훈련을 진행하던 도중, 위와 같은 오류가 발생하였습니다.

도커로 컨테이너를 생성하게 되면 호스트와 컨테이너는 공유하는 메모리 공간이 생기게 되는데 이 공간에 여유가 없어서 발생되는 에러입니다.

현재 shared memory 설정 값을 확인하려면 아래와 커맨드를 터미널에 입력합니다.

  • df -h
    • Filesystem Size Used Avail Use% Mounted on
      overlay 251G 4.9G 234G 3% /
      tmpfs 64M 0 64M 0% /dev
      tmpfs 13G 0 13G 0% /sys/fs/cgroup
      shm 64M 30M 35M 46% /dev/shm
      R:\ 932G 579G 354G 63% /root
      /dev/sdc 251G 4.9G 234G 3% /etc/hosts
      drivers 931G 452G 479G 49% /usr/bin/nvidia-smi
      lib 931G 452G 479G 49% /usr/lib/x86_64-linux-gnu/libcuda.so.1
      none 13G 0 13G 0% /dev/dxg
      tmpfs 13G 0 13G 0% /proc/acpi
      tmpfs 13G 0 13G 0% /sys/firmware

제 경우, 64MB로 설정되어 있습니다.

값을 늘리는 방법은 다음과 같습니다.

  1. 컨테이너 생성 시 —shm-size 옵션을 사용하여 필요한 값을 입력합니다.
  2. 컨테이너 생성 시 -ipc=host를 입력하여 특정한 세그먼트만 메모리에 연결되지 않게 합니다.

참조

https://curioso365.tistory.com/136

https://jybaek.tistory.com/785