torch.cuda.is_available() Returns False

torch.cuda.is_available() returning False is one of the most frustrating errors to encounter, particularly when Docker is being used. But I would suggest taking a procedural approach to ridding yourself of this error.

1. Get the "CUDA Version" installed by running nvidia-smi. You should see something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | Off |
| 53% 73C P2 374W / 450W | 21160MiB / 24564MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1338 G /usr/lib/xorg/Xorg 107MiB |
| 0 N/A N/A 1706 G /usr/bin/gnome-shell 44MiB |
| 0 N/A N/A 11132 C ...nda3/envs/ldm/bin/python3 21004MiB |
+-----------------------------------------------------------------------------+

2. Pull Docker images for nvidia/cuda whose versions in the tags match the "CUDA Version” from above. My findings are that the versions of nvidia/cuda docker image can sometimes be below the version of CUDA installed on the host, but not above. If you choose to use an image other than one supplied by nvidia/cuda you are on your own to installing dependencies, etc.

3. Install a corresponding version of PyTorch in the Docker container that matches the version of CUDA installed on the host.

RUN python3 -m pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu114/torch_stable.html

Notice the cu114 in the URL for PyTorch? That is at or below the version of CUDA installed on the host.

4. Use the —gpus flag with docker run like this:

sudo docker run --gpus all model-train:latest

5. Test that a container can also see the GPU by running this command:

docker run -it --gpus all nvidia/cuda:11.4.0-base-ubuntu20.04 nvidia-smi

You should get the same output from that command from within the container and from the host itself.