Preparation

(30 mins)

LLaMA is a collection of foundation language models ranging from 7B to 65B parameters.

In this guide, I will be using and following Georgi Gergano’s llama.cpp, a inference of LLaMA model in pure C/C++.

I will be setting this up in a Ubuntu machine with 32Gb.

hp-system-info

To prepare for the build system, I installed these:

sudo apt install git cmake build-essential python3 python3-pip
  1. Clone and build the C/C++ codes:
git clone https://github.com/ggerganov/llama.cpp -b b1680
cd llama.cpp

# Checks that b1680 tag is checked out
git describe --tags

# using CMake
mkdir build
cd build
cmake ..

llama-cmake

cmake --build . --config Release

llama-cmake-release

  1. For the Micorsoft’s Phi2 model, I downloaded the GGUF format via here:
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf
  1. That’s all! By following the provided setup, you can now comfortably run LLaMA (ChatGPT-like) model on your local machine without any worries about exposing your prompt or data.

Running LLaMA model locally

(5 mins)

Microsoft’s Phi2 Model

This is an example of a few-shot interaction:

./build/bin/main -m ~/phi-2.Q4_K_M.gguf -n 128 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

microsoft-phi2-inference

This is the sample response for my prompt:

how to create a chatgpt like app in flutter, give me the step by step code starting with the main.dart

microsoft-phi2-chatgpt-response


Running with OpenBLAS (optional)

(2 mins)

OpenBLAS is an optimized Basic Linear Algebra Subprograms (BLAS) library. You may install with:

sudo apt-get install libopenblas-dev

Rebuilding llama with OpenBLAS on,

cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build . --config Release

# Rebuild agin after running the below command, if you see similar error:
# CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 #(message):
#   Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE)
sudo apt-get install pkg-config

llama-openblas-cmake

llama-openblas-cmake-release

With the same prompt, this is the sample response with some speed improvements:

llama-openblas-chatgpt-response