ECS Task Using GPU Taking Much Longer to Complete Than Expected

I’m using EC2 as a capacity provider for ECS tasks that require a GPU for training a model. Initially, the tasks were taking much longer to complete than expected. After some debugging, it turns out that the CPU was being used rather than the GPU which explains the slowness. Calling the following function would return False. torch.cuda.is_available() In my case, there was one main reason for this: The version of the CUDA drivers on the host (which can be found by running nvidia-smi) need to match the nvidia/cuda container from which you build your container which also needs to match the version of PyTorch with GPU support installed. My EC2 instance has CUDA 11.4 installed. So my Dockerfile looks something like this: FROM --platform=linux/amd64 nvidia/cuda:11.4.0-base-centos7 RUN yum update -y RUN yum -y groupinstall development RUN yum install -y python3 python3-devel tar gzip awscli git vim xz wget gcc make zlib-devel libjpeg-devel libffi-devel libxslt-devel libxml2-devel RUN mkdir -p /app COPY . /app WORKDIR /app RUN python3 -m pip install -r requirements.txt RUN python3 -m pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu114/torch_stable.html ENV ECS_ENABLE_GPU_SUPPORT=true ENV CUDA_VISIBLE_DEVICES=0 Notice two things about this Dockerfile: 1\. nvidia/cuda:11.4.0-base-centos7 refers to 11.4 which matches the version of CUDA installed on the host 2\. The 114 in this URL also matches the CUDA driver: https://download.pytorch.org/whl/cu114/torch_stable.html Once I corrected this, torch.cuda.is_available() returned True.

ECS Task Using GPU Taking Much Longer to Complete Than Expected

Keep reading