Setting Up Kubeflow on Kubernetes: A Step-by-Step Guide

June 24, 2024 - 4 mins read

The car inspection went well, and I will spend the rest of my half-day leave documenting the steps for setting up Kubeflow, the machine learning toolkit for kubernetes.

Preparation

Kustomize introduces a template-free way to customize application configuration, simplifying the use of off-the-shelf application. The simplest way to get started is to download the precompiled binaries:

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash

# Moves kustomize to a system-wide location
sudo mv kustomize /usr/local/bin/

kubeflow-install-kustomize

Next, pull the source code from the kubeflow manifests:

git clone https://github.com/kubeflow/manifests.git
cd manifests

Installation

Install all official Kubeflow components and common services with the following command:

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

It will take a while for everything to install. Here is the view from the Portainer dashboard after the installation is complete.

kubeflow-all-services

Once installed, you can access the Kubeflow Central Dashboard. On my Windows machine, I used this command:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

You can visit http://localhost:8080, using the default email address user@example.com and default password 12341234.

kubeflow-dashboard-login

Here is the Kubeflow dashboard upon successful installation:

kubeflow-dashboard-success-install

Notebooks

Let’s start by creating a new notebook named first-notebook. Select the image and leave the other settings as default:

kubeflow-new-notebook

The creation of the new notebook will take some time:

kubeflow-first-notebook-created

To execute the first Python command, enter the following in the first cell and then press shift + Enter:

print("Hello World!")

That’s it! We have successfully created our first working notebook in Kubeflow:

kubeflow-first-hello-world-python-command

Optional - Add Test User

To add a new test user to Kubeflow, you may restart by removing the current setup with these commands (or just proceed to the next):

# Remain in this folder in subsequent steps
cd manifest

# Delete the entire installs
kustomize build example | kubectl delete -f -

# Delete the remaining namespace
kubectl delete ns kubeflow-user-example-com

# Run this in another terminal if deleting a particular namespace hangs
kubectl get namespace "kubeflow-user-example-com" -o json \
  | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" \
  | kubectl replace --raw /api/v1/namespaces/kubeflow-user-example-com/finalize -f -

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

Generate the password hash with this command:

# Install passlib
pip install passlib

# Generate the password hash
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

kubeflow-test-example-com-password

Add a password for this test user by updating the following file:

vi common/dex/base/dex-passwords.yaml

kubeflow-dex-passwords

Next, add the new user by updating the file common/dex/base/config-map.yaml with the following:

- email: test@example.com
  hashFromEnv: DEX_TEST_PASSWORD
  username: test

kubeflow-test-example-com-username

Finally, update the default setting in apps/centraldashboard/upstream/base/params.env with the following:

CD_REGISTRATION_FLOW=true

Reinstall Kubeflow with the same command:

# Execute these to replace the defaults (otherwise test@example.com will not be recognised):
kubectl apply -f common/dex/base/dex-passwords.yaml -n auth
kubectl apply -f common/dex/base/config-map.yaml -n auth
kubectl rollout restart deployment dex -n auth

From Windows, use the same port forwarding command to log in as the newly created test user and grant access:

kubeflow-test-example-com-login

With the registration flow enabled, you will be greeted with:

kubeflow-test-example-com-welcome

And this:

kubeflow-test-example-com-namespace

That’s all there is to it! Now you have your own new test user namespace.

kubeflow-test-example-com-dashboard

Troubleshooting

Too many open files error

If you see a too many open files error, such as:

In the admission-webhook-deployment pod:

kubeflow-error-too-many-open-files

In the ml-pipeline pod:

kubeflow-error-too-many-open-files-2

In the training-operator pod:

kubeflow-error-too-many-open-files-3

You can temporarily increase the max_user_instances and max_user_watches settings and then redeploy the affected pods (reference):

# default returns a value of 128
cat /proc/sys/fs/inotify/max_user_instances
echo 2280 | sudo tee /proc/sys/fs/inotify/max_user_instances

# default returns a value of around 121865, depending on the current ubuntu host
cat /proc/sys/fs/inotify/max_user_watches
echo 1255360 | sudo tee /proc/sys/fs/inotify/max_user_watches

For a permanent fix, add these lines to /etc/sysctl.conf:

fs.inotify.max_user_instances = 2280
fs.inotify.max_user_watches = 1255360

kubeflow-sysctl-increase-user-settings

No Namespaces

If you see no namespaces in your Kubeflow dashboard, it could be due to initial errors during installation. Here is my first Kubeflow dashboard without any namespaces:

kubeflow-dashboard-without-namespace

Without namespaces, the LAUNCH button remains disabled when trying to create a new notebook.

To resolve this, I followed these steps:

# Expected result, namespaces "kubeflow-user-example-com" not found
kubectl get ns kubeflow-user-example-com

# Uninstall kubeflow
kustomize build example | kubectl delete -f -

# Ensure that all namespaces are deleted before reinstalling kubeflow
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

Jwks doesn't have key to match kid or alg from Jwt

If you encounter the Jwks doesn’t have key to match kid or alg from Jwt error, try logging in using incognito mode. If the error still persists, please clear your browser cache.

kubeflow-jwks-doesnt-have-key-to-match-kid

This is a post in the Machine Learning Operations series.
Other posts in this series:

October 20, 2024 - Integrating MLflow and Kubeflow on Talos
July 20, 2024 - Building Your First Kubeflow Pipeline: A Step-by-Step Guide
June 30, 2024 - Setting Up and Using KServe with Kubeflow
June 24, 2024 - Setting Up Kubeflow on Kubernetes: A Step-by-Step Guide