Building a cheap AI/ML machine to run on a homelab kubernetes cluster

Over the last few months, I’ve spent some time downloading and using models from Hugging Face. It’s an exciting time in ML especially given that the open source community is leading the charge in a number of areas of the industry.

The best machine that I had for this type of work was an Intel i7 12000 with a 12GB RTX 3060. That set up isn’t bad for things like stable diffusion, but unfortunately it bottoms out on many other models that expect a minimum 16GB VRAM and really doesn’t have the ability to train larger models, something that I want to start getting into.

After some looking around I decided to build a new machine using end-of-life server hardware. I’ve previously put together a cluster of old Dell servers that I still have running in my basement and decided that adding an extra machine to that cluster would be the way to go.

The main goal here is for learning so my goal was to spend as little as I could. Given this, one of the best options available at the current time as far as GPUs is the Nvidia P40. These GPUs are 24GB of VRAM each and are end of life which means that they are easy to pick up second hand or refurbished. There is lower demand than you might expect for a cheap card with such a high level of memory as they are passively cooled which proves problematic with typical consumer level gear.

All up, I spent $1100 on this new machine which included;

  • 1 Dell R730 w/ 256GB RAM @ $550
  • 2 18 core 2.3GHz CPUs (E5-2696v3) @ $54 each
  • 2 6TB SAS drives @ $48 each
  • 2 Nvidia P40s @ $174 each

Install Ubuntu

The initial step was to install Ubuntu. This actually proved more challenging than previous server installs.

Firstly, I had a number of issues with Ubuntu 22.04 server and getting the nvidia drivers working with the P40s. I found that I could avoid these issues by dropping back to 20.04 when the P40s were installed into my R720 but when replicating this with the R730, I found that there were issues with the networking driver when Ubuntu 20.04 runs on a R730.

Ubuntu 22.04 server also had issues on the R730 which seemed related to the onboard display controller, hanging during boot with a message of i915 enabling device (0006 -> 0007).

Yes, the time it took to try and figure this out was a good number of precious hours switching out cards and doing OS installs. Installing Ubuntu 22.04 desktop appeared to solve the display controller issue so I settled with having the desktop version installed. I figured that seeings as though I’ll likely be doing image / video generation, having a GUI available might not be all that bad.

Once Ubuntu was up and running, it was a case of installing ssh and locking down the IP (as described in my previous home data center post).

Install nvidia-driver-525

The next step was to install the relevant nvidia drivers. For this install I used 525.

sudo apt install nvidia-driver-525

During this install, you are required to enter a password in order to enrol a Machine-Owner Key (MOK) if you’re using secure boot. This will look similar to the following;

 ┌───────────────────────┤ Configuring Secure Boot ├───────────────────────┐
 │                                                                         │
 │ Your system has UEFI Secure Boot enabled.                               │
 │                                                                         │
 │ UEFI Secure Boot requires additional configuration to work with         │
 │ third-party drivers.                                                    │
 │                                                                         │
 │ The system will assist you in configuring UEFI Secure Boot. To permit   │
 │ the use of third-party drivers, a new Machine-Owner Key (MOK) has been  │
 │ generated. This key now needs to be enrolled in your system's           │
 │ firmware.                                                               │
 │                                                                         │
 │ To ensure that this change is being made by you as an authorized user,  │
 │ and not by an attacker, you must choose a password now and then         │
 │ confirm the change after reboot using the same password, in both the    │
 │ "Enroll MOK" and "Change Secure Boot state" menus that will be          │
 │ presented to you when this system reboots.                              │
 │                                                                         │
 │ If you proceed but do not confirm the password upon reboot, Ubuntu      │
 │ will still be able to boot on your system but any hardware that         │
 │ requires third-party drivers to work correctly may not be usable.       │
 │                                                                         │
 │                                 <Ok>                                    │
 │                                                                         │
 └─────────────────────────────────────────────────────────────────────────┘

After the install completed, I rebooted the server which gave an option to enrol these keys. I selected ‘Enrol Key’ and once the server completed it’s startup, I could see that the drivers where loaded and the GPUs were available using the nvidia-smi command.

Install Docker

My current Kubernetes cluster is on 1.23 so I know that I need a version of Docker that will be compatible with that version Kubernetes which is 5:20.10.23~3-0~ubuntu-jammy

To install the relevant docker version it was a case of doing the following. This install is a pretty common pattern so I’ll avoid getting into the details here but in general, it’s a case of adding the docker source and installing the specific version of docker that I required.

sudo apt-get update
sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release
mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce=5:20.10.23~3-0~ubuntu-jammy

Install nvidia-container-toolkit

The next step is allowing docker to access the GPUs. Similar to installing docker itself, I added a source for nvidia-container-toolkit and then installed it directly.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Once the toolkit is installed, the docker daemon needs to be informed. This is a case of updating the daemon configuration which can be done using the following command (which is much nicer than having to do it manually!).

sudo nvidia-ctk runtime configure --runtime=docker

After this, the docker daemon configuration will have something similar added to it as below:

cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Restarting the docker daemon picks up the new configuration which is just a case of using systemctl:

sudo systemctl restart docker

Then we can test that docker has access to the GPU by running nvidia-smi within a container:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01               Driver Version: 525.78.01   CUDA Version: 12.0     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:03:00.0 Off |                  Off |
| N/A   41C    P0              52W / 250W |   7226MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P40                      Off | 00000000:82:00.0 Off |                  Off |
| N/A   29C    P8              10W / 250W |      6MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Yay! 48GB of VRAM1

Install kubernetes

As mentioned above, my Kubernetes cluster is still on 1.23 so I need to install the 1.23 versions of kubeadm, kubectl and kubelet. I’ve covered this off in by other post but for reference the command is similar to the below:

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl
# Old version per other posts and the old location for googles apt-key!
# sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
# New location of apt-key!
sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://dl.k8s.io/apt/doc/apt-key.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubeadm=1.23.16-00 kubelet=1.23.16-00 kubectl=1.23.16-00
sudo apt-mark hold kubelet kubeadm kubectl

Then on the existing control node, we can get a join command using the following:

kubeadm token create --print-join-command
kubeadm join 10.0.0.200:6443 --token REDACTED --discovery-token-ca-cert-hash sha256:REDACTED

Running that join command on the R730 made it available as a new node within the cluster which is visible using kubectl get nodes.

Install nvidia-device-plugin

The last step is to allow Kubernetes to access the GPU via docker which can be done with nvidia-device-plugin.

This is a pretty simple case of applying their config to the cluster.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

To expose the GPU to pods running on kubernetes on this node, setting the default runtime to nvidia will ensure that each pod has access. This can be done by updating the default runtime to nvidia within /etc/docker/daemon.json by adding the following:

"default-runtime": "nvidia",

Without the above, the nvidia-device-plugin logs will likely have something like the below:

Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
Detected non-Tegra platform: /sys/devices/soc0/family file not found
Incompatible platform detected

After the device plugin is running, I deploy cschranz/gpu-jupyter and log into the lab.

I start a new python3 notebook and running the following should show that there are GPUs available.

import tensorflow as tf
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

Wrap up

At this point, I have a new node on my kubernetes cluster with GPU capabilities. I can apply new GPU based deployments specifying affinity with this node and have them use the P40s.

This makes setting up new use cases quick and easy. One example is using gpu based jupyter notebooks for ML tasks, but I have also tested out stable diffusion and audiocraft already and both work well.