Skip to main content

Installing Nvidia drivers on GPU instances running Ubuntu

Overview

note

Consider using a Civo CUDA disk image instead — it ships with NVIDIA drivers and CUDA preinstalled, and handles the single-H100 GPU NVLink workaround for you.

To take advantage of the Nvidia GPU in Civo GPU instances, you may wish to install a specific version of Nvidia GPU drivers on an instance running Ubuntu. This document will detail the following:

Preparation:

  • Upgrading the Ubuntu package list
  • Removal of all existing Nvidia drivers preinstalled on Ubuntu

Driver installation:

  • Installation of the Ubuntu common drivers
  • Listing the devices
  • Installing the drivers for the above devices
  • Installing the Nvidia 535 driver

Testing the installation is successful:

  • Rebooting the instance
  • Running nvidia-smi

Installing CUDA:

  • Updating the package list
  • Installing the Nvidia CUDA Toolkit

Preparation

To upgrade the Ubuntu package list, run the following command in an SSH connection to the instance to update the listing and install any upgrades:

sudo apt update && sudo apt upgrade -y

The output will list any upgraded packages.

Then, to remove any existing pre-installed Nvidia packages, run the following command:

sudo apt autoremove nvidia* --purge

The output will list all known Nvidia packages, and the operation taken on each.

Driver installation

To ensure the Ubuntu common drivers are installed, run the following command in an SSH connection to the instance:

sudo apt install ubuntu-drivers-common -y

The output will list any newly installed packages and drivers, along with instructions on whether the instance needs to be rebooted.

To list devices and their drivers, run the command ubuntu-drivers devices:

civo@driver-demo-a16a-8cf238:~$ ubuntu-drivers devices
ERROR:root:aplay command not found
== /sys/devices/pci0000:00/0000:00:01.5/0000:06:00.0 ==
modalias : pci:v000010DEd000020F1sv000010DEsd0000145Fbc03sc02i00
vendor : NVIDIA Corporation
model : GA100 [A100 PCIe 40GB]
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-535-server-open - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-525 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-525-open - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-535-open - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

The ERROR:root:aplay line can be ignored, as virtual machines do not support sound output provided by aplay. Your output may list different versions and model architecture, depending on the GPU and OS version you have chosen for your instance.

Next, install the drivers, by running:

sudo ubuntu-drivers autoinstall

The output will be a long list of installation confirmations and system messages confirming that the kernel and drivers are installed and up to date.

To install the Nvidia 535 driver, run:

sudo apt install nvidia-driver-535

If there is a different version of the driver you wish to install, substitute its name in the above command.

If your instance has a single H100 GPU, you need an extra step before the driver will load cleanly. The NVIDIA driver tries to bring up the NVLink fabric at load time, fails because there is no peer GPU on the host, and the driver never becomes ready. The symptom is nvidia-smi hanging or erroring out after the reboot below.

The fix is to set the NVreg_NvLinkDisable=1 module option for the nvidia kernel module via a file in /etc/modprobe.d/:

echo "options nvidia NVreg_NvLinkDisable=1" | sudo tee /etc/modprobe.d/nvidia-disable-nvlink.conf

The option is only read when the nvidia kernel module is loaded, so the change does not take effect until the module is reloaded. The simplest reliable way to do this is a reboot — which is the next step of this guide anyway.

warning

Do not apply this on multi-GPU instances. NVLink is how the GPUs talk to each other on those hosts — disabling it cuts peer-to-peer bandwidth and degrades multi-GPU workloads.

After the reboot, you can confirm the option is active with:

cat /proc/driver/nvidia/params | grep NvLinkDisable

Testing the installation

First, reboot the machine:

$ sudo reboot now
Connection to [IP address] closed by remote host.
Connection to [IP address] closed.

Then, after the reboot completes (in about 30 seconds), connect back to the instance using SSH.

Once connected, run the nvidia-smi command.

$ nvidia-smi
Mon Dec 18 21:14:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:06:00.0 Off | 0 |
| N/A 39C P0 38W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

The version numbers should show the number corresponding to the version you installed in the earlier steps.

Installing CUDA

note

If you just finished with the previous sections, you can skip the package update step and proceed straight to installing the CUDA toolkit.

Update the Ubuntu package listings and installation:

sudo apt update && sudo apt upgrade -y

Install the CUDA toolkit:

sudo apt install nvidia-cuda-toolkit -y

Once the packages are downloaded and installed, you can test that the toolkit is the version you expect:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0