Deploying OpenAI-Compatible LLAMA CPP Server with K3S

December 22, 2023 - 4 mins read

Author: See Hiong

Commencing my week-long Christmas break, I extend the concepts from my previous post to establish an OpenAI-compatible server in my Home Lab.

Technical Setup

After fine-tuning a sample Dockerfile, I reinstalled my Ubuntu server, incorporating necessary adjustments. The subsequent setup commands, reflecting my Home Lab’s new IP address (192.168.68.115), include:

sudo apt update & sudo apt upgrade -y

# Install docker
sudo apt install docker.io
sudo usermod -aG docker pi

# Install Anaconda
curl -O https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
chmod +x Anaconda3-2023.09-0-Linux-x86_64.sh
./Anaconda3-2023.09-0-Linux-x86_64.sh

# Init conda
source /home/pi/anaconda3/bin/activate
conda init
conda create -n docker-llama python
conda activate docker-llama

The corresponding Dockerfile features:

FROM python:3-slim-bullseye

# We need to set the host to 0.0.0.0 to allow outside access
ENV HOST 0.0.0.0

COPY ./phi-2.Q4_K_M.gguf .

# Install the package
RUN apt update && apt install -y libopenblas-dev ninja-build build-essential pkg-config
RUN pip install --upgrade pip
RUN python -m pip install  --no-cache-dir --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context

RUN CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install --no-cache-dir --force-reinstall llama_cpp_python==0.2.24 --verbose

# Run the server
CMD ["python3", "-m", "llama_cpp.server", "--model", "/phi-2.Q4_K_M.gguf"]

For the Micorsoft’s Phi2 model, I downloaded the GGUF format via here:

wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf

Docker Image Build and Run

The image, packaged with Microsoft’s Phi2 model, is built using:

homelab-docker-llama

docker build . -t llama-microsoft-phi2:v0.2.24

homelab-llama-cpp-python

To run the image:

docker run -p 8000:8000 --rm -it llama-microsoft-phi2:v0.2.24

homelab-docker-run-llama-cpp-python

To resolve the “failed to mlock” warning, add –cap-add IPC_LOCK like so:
docker run --cap-add IPC_LOCK -p 8000:8000 --rm -it llama-microsoft-phi2:v0.2.24

homelab-llama-cpp-python-docs

Docker Image Push and Deployment

Establishing a local registry on the new Ubuntu server is the first step:

sudo vi /etc/docker/daemon.json

Insert the following content into daemon.json:

{
  "insecure-registries": [
    "192.168.68.115:30500"
  ]
}

Configure Docker options:

sudo vi /etc/default/docker

Add the line:

DOCKER_OPTS="--config-file=/etc/docker/daemon.json"

Restart Docker:

sudo systemctl restart docker

Tag and push the image to the home lab:

docker image ls
docker tag llama-microsoft-phi2:v0.2.24 192.168.68.115:30500/llama-microsoft-phi2:v0.2.24
docker push 192.168.68.115:30500/llama-microsoft-phi2:v0.2.24

For larger image layers, bypass retry errors using:

sudo mkdir -p /etc/systemd/system/docker.service.d
sudo vi /etc/systemd/system/docker.service.d/http-proxy.conf

Add to http-proxy.conf:

[Service]
Environment="NO_PROXY=localhost,127.0.0.1,192.168.68.115"

Reload Docker:

sudo systemctl daemon-reload
sudo systemctl restart docker

homelab-reload-registry-deployment

Joining Home Lab as a K3S Node

Join as a node:

curl -sfL https://get.k3s.io | K3S_URL=https://192.168.68.115:6443 K3S_TOKEN=K10e848701b18977c63d7abfce920cf66c0f19bdd18a40862b2e7a14b89c4eb2742::server:ac92f2b7ccebbb46bf241bdaea3c99bf sh -

Configure the insecure registry for K3S agents:

sudo vi /etc/systemd/system/k3s-agent.service

# Restart k3s agent after the change
sudo systemctl daemon-reload
sudo systemctl restart k3s-agent

Change ExecStart of k3s-agent.service to:

ExecStart=/usr/local/bin/k3s \
    agent \
    --docker
    --insecure-registry=http://192.168.68.115:30500

homelab-add-insecure-registry-to-k3s-agent

Deployment to K3S in Home Lab

Create a llama-phi2.yaml for deployment (IPC_LOCK setting for resolving “failed to mlock” warning):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-phi2
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-phi2
  template:
    metadata:
      labels:
        app: llama-phi2
        name: llama-phi2
    spec:
      containers:
      - name: llama-phi2
        image: 192.168.68.115:30500/llama-microsoft-phi2:v0.2.24
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            memory: "6Gi"
          limits:
            memory: "6Gi"
        ports:
        - containerPort: 8000
        securityContext:
          capabilities:
            add:
            - IPC_LOCK
      imagePullSecrets:
      - name: regcred

Deploy using:

kca llama-phi2.yaml

For service exposure, create llama-phi2-svc.yaml:

apiVersion: v1
kind: Service
metadata:
  name: llama-phi2-svc
  namespace: llm
spec:
  selector:
    app: llama-phi2
  type: NodePort
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8000
      nodePort: 30000

Apply to the K3S cluster:

kca llama-phi2-svc.yaml

Access the llama-phi2 server through nodePort 3000:

homelab-llama-cpp-python-from-k3s

If you want a straight forward label-based match, you may use node selector to use a specific host to run the pod:
spec:
 nodeSelector:
   hostname: alien
   # Alternatively you may consider the default kubernetes.io/hostname: alien
   # By using the default value, we need not set the label

To label the node, you may use this:

kc label no alien hostname=alien

# Check current labels
kc get no --show-labels

# Re-deploy with the latest yaml change
kca llama-phi2-deploy.yaml

Troubleshooting

If k3s server tries to assign a brand new worker node, you may face similar “Illegal instruction (core dumped)” issue.

Add the folling command to your yaml file:

spec:
  containers:
  - name: llama-phi2
    image: 192.168.68.115:30500/llama-microsoft-phi2:v0.2.24
    command: ["/bin/sh", "-c", "tail -f /dev/null"]
    # ... rest of the container spec ...

Get into the console of the pod from portainer:

# python3 -m llama_cpp.server --model /phi-2.Q4_K_M.gguf
Illegal instruction (core dumped)

From within the console, you may debug the python code:

# Use GDB to anaylze the core dump
apt-get install gdb

gdb python3

# Within GNU gdb, issue command; type quit to exit gdb
(gdb) run -m llama_cpp.server --verbose

This is the error which I faced:

warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f5e5b9a7700 (LWP 305)]
[New Thread 0x7f5e5b1a6700 (LWP 306)]
[New Thread 0x7f5e529a5700 (LWP 307)]

Thread 1 "python3" received signal SIGILL, Illegal instruction.
0x00007f5e5e21d34c in std::vector<std::pair<unsigned int, unsigned int>, std::allocator<std::pair<unsigned int, unsigned int> > >::vector(std::initializer_list<std::pair<unsigned int, unsigned int> >, std::allocator<std::pair<unsigned int, unsigned int> > const&) ()
   from /usr/local/lib/python3.12/site-packages/llama_cpp/libllama.so

What I found out was that the image built from the Ubuntu server caused the above issue. After switching to my Windows machine and re-build, the error was resolved!