Integrating MLflow and Kubeflow on Talos

October 20, 2024 - 6 mins read

In this post, I will detail the installation of MLflow and Kubeflow on my Talos Homelab cluster.

Preparation

I have decided to reinitialize my homelab. You can follow similar steps in your own environment.

Talos Setup

As outlined in my previous Talos Linux setup, here is my updated control.patch file:

machine:
    network:
      hostname: control
    install:
        disk: /dev/nvme0n1
        image: ghcr.io/siderolabs/installer:v1.7.6
        wipe: true
    kubelet:
        defaultRuntimeSeccompProfileEnabled: false
cluster:
    apiServer:
        admissionControl:
            - name: PodSecurity
              configuration:
                apiVersion: pod-security.admission.config.k8s.io/v1alpha1
                defaults:
                    audit: privileged  
                    audit-version: latest
                    enforce: privileged  
                    enforce-version: latest
                    warn: privileged  
                    warn-version: latest
                exemptions:
                    namespaces: [] # Apply to all namespaces
                    runtimeClasses: []
                    usernames: []
                kind: PodSecurityConfiguration

I encountered an issue (time query error with server “17.253.60.125”) while setting up the latest Talos v1.8.1, which was being resolved with:
# Edit control node
talosctl edit machineconfig -n 192.168.68.115
machine:
 time:
   disabled: false
   servers:
       - time.cloudflare.com

For my first worker node, here’s the worker-1.patch:

machine:
  network:
    hostname: worker-1
  install:
      disk: /dev/nvme0n1
      image: ghcr.io/siderolabs/installer:v1.7.6
      wipe: true
  kubelet:
    extraMounts:
      - destination: /var/mnt
        type: bind
        source: /var/mnt
        options:
          - bind
          - rw

The installation steps remain unchanged:

# Single master node
talosctl gen config homelab https://192.168.68.115:6443
talosctl disks --insecure -n 192.168.68.115
talosctl machineconfig patch controlplane.yaml --patch @control.patch --output control.yaml
talosctl apply-config --insecure -n 192.168.68.115 --file control.yaml
talosctl bootstrap --nodes 192.168.68.115 --endpoints 192.168.68.115 --talosconfig talosconfig

# Worker nodes
talosctl machineconfig patch worker.yaml --patch @worker-1.patch --output worker-1.yaml
talosctl apply-config --insecure -n 192.168.68.117 --file worker-1.yaml

Local Path Provisioner

Local-path will serve as the default storageClass for ReadWriteOnce access modes. Follow these steps:

curl https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml -O

Edit the local-path-storage.yaml file to set it as the default:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
  annotations:
        storageclass.kubernetes.io/is-default-class: "true" # around line 120
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: local-path-config
  namespace: local-path-storage
data: # below section around line 131
  config.json: |-
    {
            "nodePathMap":[
            {
                    "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                    "paths":["/var/mnt"]
            }
            ]
    }

NFS

To support ReadWriteMany access modes, follow these steps:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --set nfs.server=192.168.68.111 \
    --set nfs.path=/mnt/public

Metallb

To install Metallb, execute the following:

curl https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml -O
kubectl apply -f metallb-native.yaml

kubectl apply -f metallb-ipaddresspool.yaml
kubectl apply -f metallb-l2advertisement.yaml

metallb-ipaddresspool.yaml example:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.68.220-192.168.68.240

metallb-l2advertisement.yaml example:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: first-advert
  namespace: metallb-system
spec:
  ipAddressPools:
  - first-pool

Kubeflow

To install Kubeflow, follow the steps from my previous Kubeflow setup:

git clone https://github.com/kubeflow/manifests.git

cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

mlflow-kubeflow-namespace-pods

MLflow

MLflow is an open-source platform designed to streamline the machine learning lifecycle, ensuring that all phases are manageable and reproducible.

To install MLflow on my Talos HomeLab cluster:

helm install mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --namespace mlflow --create-namespace

# Sample output
# CHART NAME: mlflow
# CHART VERSION: 2.0.2
# APP VERSION: 2.17.0
# 
# ** Please be patient while the chart is being deployed **
# You didn't specify any entrypoint to your code.
# To run it, you can either deploy again using the `source.launchCommand` option to specify your entrypoint, or # # execute it manually by jumping into the pods:
# 
# 1. Get the running pods
#     kubectl get pods --namespace mlflow -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=mlflow"
# 
# 2. Get into a pod
#     kubectl exec -ti [POD_NAME] bash
# 
# 3. Execute your script as you would normally do.
# MLflow Tracking Server can be accessed through the following DNS name from within your cluster:
# 
#     mlflow-tracking.mlflow.svc.cluster.local (port 80)
# 
# To access your MLflow site from outside the cluster follow the steps below:
# 
# 1. Get the MLflow URL by running these commands:
# 
#   NOTE: It may take a few minutes for the LoadBalancer IP to be available.
#         Watch the status with: 'kubectl get svc --namespace mlflow -w mlflow-tracking'
# 
#    export SERVICE_IP=$(kubectl get svc --namespace mlflow mlflow-tracking --template "{{ range (index .status.# # loadBalancer.ingress 0) }}{{ . }}{{ end }}")
#    echo "MLflow URL: http://$SERVICE_IP/"
# 
# 2. Open a browser and access MLflow using the obtained URL.
# 3. Login with the following credentials below to see your blog:
# 
#   echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
#   echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)

Launching MLflow

Using K9s, you can check the external IP exposed via Metallb.

mlflow-namespace-svc

Navigate to http://192.168.68.220 and log in with:

echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)

mlflow-experiments-page

Launching Jupyter Notebook

For the default Kubeflow installation, port-forward the istio-ingressgateway to port 8080:

mlflow-kubeflow-port-forward

I created a new notebook using the default jupyter-scipy:v1.9.1 image.

mlflow-kubeflow-new-notebook

Getting Started with MLflow

Following the official MLflow Tracking Quickstart, here are the steps:

Install MLflow:

pip install mlflow==2.14.0rc0

mlflow-install-terminal

Set the Tracking Server URI:

import mlflow

mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")

Train a model and log metadata:

import mlflow
from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "multi_class": "auto",
    "random_state": 8888,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

Log the model and metadata to MLflow:

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-quickstart",
    )

To authenticate with MLflow, create a .mlflow/credentials file and run the above code:

mkdir ~/.mlflow
echo "[mlflow]" > ~/.mlflow/credentials
echo "mlflow_tracking_username = user" >> ~/.mlflow/credentials
echo "mlflow_tracking_password = 39VpDZdVLr" >> ~/.mlflow/credentials

mlflow-credentials-file-and-log-to-mlflow

You should see the new experiment logged in MLflow:

mlflow-quickstart-log

Load the model for inference:

# Load the model back for predictions as a generic Python Function model
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

predictions = loaded_model.predict(X_test)

iris_feature_names = datasets.load_iris().feature_names

result = pd.DataFrame(X_test, columns=iris_feature_names)
result["actual_class"] = y_test
result["predicted_class"] = predictions

result[:4]

With a commendable 100% accuracy, your predictions should look like this:

mlflow-load-model-and-for-inference

This concludes the installation and setup of MLflow and Kubeflow in your Talos HomeLab cluster. You can now effectively manage your machine learning lifecycle, leveraging both platforms for optimal productivity.

This is a post in the Machine Learning Operations series.
Other posts in this series:

October 20, 2024 - Integrating MLflow and Kubeflow on Talos
July 20, 2024 - Building Your First Kubeflow Pipeline: A Step-by-Step Guide
June 30, 2024 - Setting Up and Using KServe with Kubeflow
June 24, 2024 - Setting Up Kubeflow on Kubernetes: A Step-by-Step Guide