ECS Task Using GPU Taking Much Longer to Complete Than Expected

I’m using EC2 as a capacity provider for ECS tasks that require a GPU for training a model. Initially, the tasks were taking much longer to complete than expected.

After some debugging, it turns out that the CPU was being used rather than the GPU which explains the slowness.

Calling the following function would return False.

torch.cuda.is_available()

In my case, there was one main reason for this:

The version of the CUDA drivers on the host (which can be found by running nvidia-smi) need to match the nvidia/cuda container from which you build your container which also needs to match the version of PyTorch with GPU support installed.

My EC2 instance has CUDA 11.4 installed. So my Dockerfile looks something like this:

FROM  --platform=linux/amd64 nvidia/cuda:11.4.0-base-centos7
RUN yum update -y
RUN yum -y groupinstall development
RUN yum install -y python3 python3-devel tar gzip awscli git vim xz wget gcc make zlib-devel libjpeg-devel libffi-devel libxslt-devel libxml2-devel

RUN mkdir -p /app
COPY . /app
WORKDIR /app

RUN python3 -m pip install -r requirements.txt
RUN python3 -m pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu114/torch_stable.html

ENV ECS_ENABLE_GPU_SUPPORT=true
ENV CUDA_VISIBLE_DEVICES=0

Notice two things about this Dockerfile:

1. nvidia/cuda:11.4.0-base-centos7 refers to 11.4 which matches the version of CUDA installed on the host

2. The 114 in this URL also matches the CUDA driver: https://download.pytorch.org/whl/cu114/torch_stable.html

Once I corrected this, torch.cuda.is_available() returned True.