Boosting Inference Speed: SSD and GPU Acceleration

November 30, 2023 - 3 mins read

Author: See Hiong

In the relentless pursuit of optimal disk space and lightning-fast inference speeds, I embarked on an exciting upgrade journey by integrating the formidable Lexar NM790 M.2 2280 PCIe SSD. This blog post unfolds in two parts: the first chronicles the meticulous migration of my Windows 11 to this powerhouse SSD, while the second unveils the secrets behind the enhanced inferencing speed for the Langchain4j application.

Part 1: Seamless OS Migration with Clonezilla

Amidst a sea of software promising seamless disk cloning, I found solace in the reliability of Clonezilla, a robust open-source tool for disk imaging and cloning.

After securely installing the Lexar SSD into slot 2 of my PC, I tapped into the power of Clonezilla. Following the Clone your SSD or Hard Drive guide, I navigated through the process, opting for the -k1 option to create a proportional partition table.

The result? A seamlessly cloned SSD, now the master boot, offering a tangible boost in inferencing speed as all my LLM models found their new home on this high-performance SSD.

Part 2: GPU Acceleration Unleashed

Step 1: Installing CUDA for Windows

My journey continued with the installation of CUDA for Microsoft Windows. Armed with the NVIDIA CUDA Toolkit, I confirmed the successful installation using two simple PowerShell commands: nvcc –version and nvidia-smi.

nvidia-cuda-compiler

nvidia-smi

Step 2: NVIDIA Container Toolkit Magic

The next leg of this GPU odyssey involved the installation of the NVIDIA Container Toolkit. Delving into the realm of Windows Subsystem for Linux (WSL), I executed a series of commands to seamlessly integrate this toolkit into my environment.

# Setting up the APT repository and configuring it
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

# install the NVIDIA container toolkit package
sudo apt-get install -y nvidia-container-toolkit

Configuring Docker for this newfound power involved a few more commands:

# Configuring containter runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restarting docker daemon
sudo systemctl restart docker

# If docker.service is not found, you may reinstall and reconfigure runtime
sudo apt-get --reinstall install docker.io

From the Docker Desktop settings, I ventured into extra configurations, applied the changes, and restarted:

  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }

docker-engine-settings

The grand test unfolded with a sample workload command:

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

sample-workload-with-docker

Be vigilant! Without the Docker Desktop setting, you might encounter the ominous error depicted below:

The Grand Finale

Bringing it all together, the magic command to launch the LocalAI image in a GPU Docker container was unveiled:

docker run -p 8080:8080 -v c:/local-ai-models:/models -ti --rm quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12 --models-path /models --context-size 2000 --threads 8 --debug=true

local-ai-gpu-docker-container

And behold, as we queried using the WizardLM model, the GPU acceleration became apparent, turning our tech realm into a high-speed universe.

local-ai-gpu-docker-model-loaded

Comparing it to the past, the results spoke for themselves—our inference speed had ascended to new heights!

local-ai-gpu-docker-inference-speed-improved

Parting Thoughts

In this journey, we’ve showcased the synergy of SSD storage and GPU acceleration, breathing life into the WizardLM model and witnessing a remarkable surge in inference speed. A testament to the relentless pursuit of tech optimization, this transformation opens new doors for Langchain4j enthusiasts, inviting you to explore the boundless possibilities of elevated performance. Elevate your tech game; dive into the future with Langchain4j and redefine what’s possible!