In this post, I will detail the installation of MLflow and Kubeflow on my Talos Homelab cluster.
Preparation
I have decided to reinitialize my homelab. You can follow similar steps in your own environment.
Talos Setup
As outlined in my previous Talos Linux setup, here is my updated control.patch file:
machine:
network:
hostname: control
install:
disk: /dev/nvme0n1
image: ghcr.io/siderolabs/installer:v1.7.6
wipe: true
kubelet:
defaultRuntimeSeccompProfileEnabled: false
cluster:
apiServer:
admissionControl:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1alpha1
defaults:
audit: privileged
audit-version: latest
enforce: privileged
enforce-version: latest
warn: privileged
warn-version: latest
exemptions:
namespaces: [] # Apply to all namespaces
runtimeClasses: []
usernames: []
kind: PodSecurityConfiguration
I encountered an issue (time query error with server “17.253.60.125”) while setting up the latest Talos v1.8.1, which was being resolved with:
# Edit control node talosctl edit machineconfig -n 192.168.68.115
machine: time: disabled: false servers: - time.cloudflare.com
For my first worker node, here’s the worker-1.patch:
machine:
network:
hostname: worker-1
install:
disk: /dev/nvme0n1
image: ghcr.io/siderolabs/installer:v1.7.6
wipe: true
kubelet:
extraMounts:
- destination: /var/mnt
type: bind
source: /var/mnt
options:
- bind
- rw
The installation steps remain unchanged:
# Single master node
talosctl gen config homelab https://192.168.68.115:6443
talosctl disks --insecure -n 192.168.68.115
talosctl machineconfig patch controlplane.yaml --patch @control.patch --output control.yaml
talosctl apply-config --insecure -n 192.168.68.115 --file control.yaml
talosctl bootstrap --nodes 192.168.68.115 --endpoints 192.168.68.115 --talosconfig talosconfig
# Worker nodes
talosctl machineconfig patch worker.yaml --patch @worker-1.patch --output worker-1.yaml
talosctl apply-config --insecure -n 192.168.68.117 --file worker-1.yaml
Local Path Provisioner
Local-path will serve as the default storageClass for ReadWriteOnce access modes. Follow these steps:
curl https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml -O
Edit the local-path-storage.yaml file to set it as the default:
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
annotations:
storageclass.kubernetes.io/is-default-class: "true" # around line 120
---
kind: ConfigMap
apiVersion: v1
metadata:
name: local-path-config
namespace: local-path-storage
data: # below section around line 131
config.json: |-
{
"nodePathMap":[
{
"node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
"paths":["/var/mnt"]
}
]
}
NFS
To support ReadWriteMany access modes, follow these steps:
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=192.168.68.111 \
--set nfs.path=/mnt/public
Metallb
To install Metallb, execute the following:
curl https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml -O
kubectl apply -f metallb-native.yaml
kubectl apply -f metallb-ipaddresspool.yaml
kubectl apply -f metallb-l2advertisement.yaml
metallb-ipaddresspool.yaml example:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: first-pool
namespace: metallb-system
spec:
addresses:
- 192.168.68.220-192.168.68.240
metallb-l2advertisement.yaml example:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: first-advert
namespace: metallb-system
spec:
ipAddressPools:
- first-pool
Kubeflow
To install Kubeflow, follow the steps from my previous Kubeflow setup:
git clone https://github.com/kubeflow/manifests.git
cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done
MLflow
MLflow is an open-source platform designed to streamline the machine learning lifecycle, ensuring that all phases are manageable and reproducible.
To install MLflow on my Talos HomeLab cluster:
helm install mlflow oci://registry-1.docker.io/bitnamicharts/mlflow --namespace mlflow --create-namespace
# Sample output
# CHART NAME: mlflow
# CHART VERSION: 2.0.2
# APP VERSION: 2.17.0
#
# ** Please be patient while the chart is being deployed **
# You didn't specify any entrypoint to your code.
# To run it, you can either deploy again using the `source.launchCommand` option to specify your entrypoint, or # # execute it manually by jumping into the pods:
#
# 1. Get the running pods
# kubectl get pods --namespace mlflow -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=mlflow"
#
# 2. Get into a pod
# kubectl exec -ti [POD_NAME] bash
#
# 3. Execute your script as you would normally do.
# MLflow Tracking Server can be accessed through the following DNS name from within your cluster:
#
# mlflow-tracking.mlflow.svc.cluster.local (port 80)
#
# To access your MLflow site from outside the cluster follow the steps below:
#
# 1. Get the MLflow URL by running these commands:
#
# NOTE: It may take a few minutes for the LoadBalancer IP to be available.
# Watch the status with: 'kubectl get svc --namespace mlflow -w mlflow-tracking'
#
# export SERVICE_IP=$(kubectl get svc --namespace mlflow mlflow-tracking --template "{{ range (index .status.# # loadBalancer.ingress 0) }}{{ . }}{{ end }}")
# echo "MLflow URL: http://$SERVICE_IP/"
#
# 2. Open a browser and access MLflow using the obtained URL.
# 3. Login with the following credentials below to see your blog:
#
# echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
# echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)
Launching MLflow
Using K9s, you can check the external IP exposed via Metallb.
Navigate to http://192.168.68.220 and log in with:
echo Username: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{ .data.admin-user }" | base64 -d)
echo Password: $(kubectl get secret --namespace mlflow mlflow-tracking -o jsonpath="{.data.admin-password }" | base64 -d)
Launching Jupyter Notebook
For the default Kubeflow installation, port-forward the istio-ingressgateway to port 8080:
I created a new notebook using the default jupyter-scipy:v1.9.1 image.
Getting Started with MLflow
Following the official MLflow Tracking Quickstart, here are the steps:
- Install MLflow:
pip install mlflow==2.14.0rc0
- Set the Tracking Server URI:
import mlflow
mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")
- Train a model and log metadata:
import mlflow
from mlflow.models import infer_signature
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Define the model hyperparameters
params = {
"solver": "lbfgs",
"max_iter": 1000,
"multi_class": "auto",
"random_state": 8888,
}
# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)
# Predict on the test set
y_pred = lr.predict(X_test)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
- Log the model and metadata to MLflow:
# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://mlflow-tracking.mlflow")
# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")
# Start an MLflow run
with mlflow.start_run():
# Log the hyperparameters
mlflow.log_params(params)
# Log the loss metric
mlflow.log_metric("accuracy", accuracy)
# Set a tag that we can use to remind ourselves what this run was for
mlflow.set_tag("Training Info", "Basic LR model for iris data")
# Infer the model signature
signature = infer_signature(X_train, lr.predict(X_train))
# Log the model
model_info = mlflow.sklearn.log_model(
sk_model=lr,
artifact_path="iris_model",
signature=signature,
input_example=X_train,
registered_model_name="tracking-quickstart",
)
- To authenticate with MLflow, create a .mlflow/credentials file and run the above code:
mkdir ~/.mlflow
echo "[mlflow]" > ~/.mlflow/credentials
echo "mlflow_tracking_username = user" >> ~/.mlflow/credentials
echo "mlflow_tracking_password = 39VpDZdVLr" >> ~/.mlflow/credentials
You should see the new experiment logged in MLflow:
- Load the model for inference:
# Load the model back for predictions as a generic Python Function model
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)
predictions = loaded_model.predict(X_test)
iris_feature_names = datasets.load_iris().feature_names
result = pd.DataFrame(X_test, columns=iris_feature_names)
result["actual_class"] = y_test
result["predicted_class"] = predictions
result[:4]
With a commendable 100% accuracy, your predictions should look like this:
This concludes the installation and setup of MLflow and Kubeflow in your Talos HomeLab cluster. You can now effectively manage your machine learning lifecycle, leveraging both platforms for optimal productivity.