Integrating MLflow and Kubeflow on Talos

In this post, I will detail the installation of MLflow and Kubeflow on my Talos Homelab cluster.

Preparation

I have decided to reinitialize my homelab. You can follow similar steps in your own environment.

Talos Setup

As outlined in my previous Talos Linux setup , here is my updated control.patch file:

machine:
    network:
      hostname: control
    install:
        disk: /dev/nvme0n1
        image: ghcr.io/siderolabs/installer:v1.7.6
        wipe: true
    kubelet:
        defaultRuntimeSeccompProfileEnabled: false
cluster:
    apiServer:
        admissionControl:
            - name: PodSecurity
              configuration:
                apiVersion: pod-security.admission.config.k8s.io/v1alpha1
                defaults:
                    audit: privileged  
                    audit-version: latest
                    enforce: privileged  
                    enforce-version: latest
                    warn: privileged  
                    warn-version: latest
                exemptions:
                    namespaces: [] # Apply to all namespaces
                    runtimeClasses: []
                    usernames: []
                kind: PodSecurityConfiguration

Note

I encountered an issue (time query error with server “17.253.60.125”) while setting up the latest Talos v1.8.1, which was being resolved with:

# Edit control node
talosctl edit machineconfig -n 192.168.68.115

machine:
  time:
    disabled: false
    servers:
        - time.cloudflare.com

For my first worker node, here’s the worker-1.patch:

machine:
  network:
    hostname: worker-1
  install:
      disk: /dev/nvme0n1
      image: ghcr.io/siderolabs/installer:v1.7.6
      wipe: true
  kubelet:
    extraMounts:
      - destination: /var/mnt
        type: bind
        source: /var/mnt
        options:
          - bind
          - rw

The installation steps remain unchanged:

# Single master node
talosctl gen config homelab https://192.168.68.115:6443
talosctl disks --insecure -n 192.168.68.115
talosctl machineconfig patch controlplane.yaml --patch @control.patch --output control.yaml
talosctl apply-config --insecure -n 192.168.68.115 --file control.yaml
talosctl bootstrap --nodes 192.168.68.115 --endpoints 192.168.68.115 --talosconfig talosconfig

# Worker nodes
talosctl machineconfig patch worker.yaml --patch @worker-1.patch --output worker-1.yaml
talosctl apply-config --insecure -n 192.168.68.117 --file worker-1.yaml

Local Path Provisioner

Local-path will serve as the default storageClass for ReadWriteOnce access modes. Follow these steps:

curl https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml -O

Edit the local-path-storage.yaml file to set it as the default:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
  annotations:
        storageclass.kubernetes.io/is-default-class: "true" # around line 120
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: local-path-config
  namespace: local-path-storage
data: # below section around line 131
  config.json: |-
    {
            "nodePathMap":[
            {
                    "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                    "paths":["/var/mnt"]
            }
            ]
    }

NFS

To support ReadWriteMany access modes, follow these steps:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --set nfs.server=192.168.68.111 \
    --set nfs.path=/mnt/public

Metallb

To install Metallb, execute the following:

curl https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml -O
kubectl apply -f metallb-native.yaml

kubectl apply -f metallb-ipaddresspool.yaml
kubectl apply -f metallb-l2advertisement.yaml

metallb-ipaddresspool.yaml example:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.68.220-192.168.68.240

metallb-l2advertisement.yaml example:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: first-advert
  namespace: metallb-system
spec:
  ipAddressPools:
  - first-pool

Kubeflow

To install Kubeflow, follow the steps from my previous Kubeflow setup :

git clone https://github.com/kubeflow/manifests.git

cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

MLflow

MLflow is an open-source platform designed to streamline the machine learning lifecycle, ensuring that all phases are manageable and reproducible.

To install MLflow on my Talos HomeLab cluster:

helm install mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --namespace mlflow --create-namespace

# Sample output
# CHART NAME: mlflow
# CHART VERSION: 2.0.2
# APP VERSION: 2.17.0
# 
# ** Please be patient while the chart is being deployed **
# You didn't specify any entrypoint to your code.
# To run it, you can either deploy again using the `source.launchCommand` option to specify your entrypoint, or # # execute it manually by jumping into the pods:
# 
# 1. Get the running pods
#     kubectl get pods --namespace mlflow -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=mlflow"
# 
# 2. Get into a pod
#     kubectl exec -ti [POD_NAME] bash
# 
# 3. Execute your script as you would normally do.
# MLflow Tracking Server can be accessed through the following DNS name from within your cluster:
# 
#     mlflow-tracking.mlflow.svc.cluster.local (port 80)
# 
# To access your MLflow site from outside the cluster follow the steps below:
# 
# 1. Get the MLflow URL by running these commands:
# 
#   NOTE: It may take a few minutes for the LoadBalancer IP to be available.
#         Watch the status with: 'kubectl get svc --namespace mlflow -w mlflow-tracking'
# 
#    export SERVICE_IP=$(kubectl get svc --namespace mlflow mlflow-tracking --template "{{ range (index .status.# # loadBalancer.ingress 0) }}{{ . }}{{ end }}")
#    echo "MLflow URL: http://$SERVICE_IP/"
# 
# 2. Open a browser and access MLflow using the obtained URL.
# 3. Login with the following credentials below to see your blog:
# 
#   echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
#   echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)

Launching MLflow

Using K9s, you can check the external IP exposed via Metallb.

Navigate to http://192.168.68.220 and log in with:

echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)

Launching Jupyter Notebook

For the default Kubeflow installation, port-forward the istio-ingressgateway to port 8080:

I created a new notebook using the default jupyter-scipy:v1.9.1 image.

Getting Started with MLflow

Following the official MLflow Tracking Quickstart , here are the steps:

Install MLflow:

pip install mlflow==2.14.0rc0

Set the Tracking Server URI:

import mlflow

mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")

Train a model and log metadata:

import mlflow
from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "multi_class": "auto",
    "random_state": 8888,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)

Log the model and metadata to MLflow:

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-quickstart",
    )

To authenticate with MLflow, create a .mlflow/credentials file and run the above code:

mkdir ~/.mlflow
echo "[mlflow]" > ~/.mlflow/credentials
echo "mlflow_tracking_username = user" >> ~/.mlflow/credentials
echo "mlflow_tracking_password = 39VpDZdVLr" >> ~/.mlflow/credentials

mlflow-credentials-file-and-log-to-mlflow

You should see the new experiment logged in MLflow:

Load the model for inference:

# Load the model back for predictions as a generic Python Function model
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

predictions = loaded_model.predict(X_test)

iris_feature_names = datasets.load_iris().feature_names

result = pd.DataFrame(X_test, columns=iris_feature_names)
result["actual_class"] = y_test
result["predicted_class"] = predictions

result[:4]

With a commendable 100% accuracy, your predictions should look like this:

This concludes the installation and setup of MLflow and Kubeflow in your Talos HomeLab cluster. You can now effectively manage your machine learning lifecycle, leveraging both platforms for optimal productivity.