Unleashing Text Generation with NVIDIA Jetson Orin NX

August 17, 2024 - 2 mins read

Navigating through the NVIDIA Jetson AI Lab has been an exhilarating experience, showcasing the potential of generative AI powered by NVIDIA® Jetson™. With a plethora of labs to explore, it’s challenging to cover everything in a limited time. In this post, I’ll focus on labs related to text generation.

Preparation

If you follow my Jetson Orin NX flashing guide, you might have noticed that a browser is not pre-installed. I recommend installing Brave, a browser that blocks ads and conserves data. To install it, simply run:

sudo snap install brave

When running nvidia-smi, you might expect to see a GPU listed, but instead, you’ll find none. This is because the Jetson devices use an integrated GPU (iGPU) that connects directly to the memory controller. For monitoring GPU usage, I recommend using the built-in Jetson Power GUI.

Text Generation - WebUI

Following the guide on text-generation-webui, clone the repository to utilize the utilities that will automatically pull and start the appropriate container:

git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

jetson-containers run $(autotag text-generation-webui)

jetson-text-text-generation-webui

LLama-2-7b-Chat-GGUF

To start, I used TheBloke/Llama-2-7b-Chat-GGUF with the single-file model quantization llama-2-7b-chat.Q4_K_M.gguf:

jetson-text-download-llama-2-7b-chat

After downloading the model, I loaded it:

jetson-text-load-llama-2-7b-chat

Once set up, I was able to start chatting locally. On my Orin NX 16G, the typical rate was 4.15 tokens per second:

jetson-text-llama-2-7b-chat-ui

jetson-text-llama-2-7b-chat-logs

Sheared-LLaMa-1-3b-ShareGPT

Next, I experimented with princeton-nlp/Sheared-LLaMA-1.3B-ShareGPT and set the Model Loader to Transformers using the same input prompt:

jetson-text-sheared-llama-1-3b-sharegpt-ui

jetson-text-sheared-llama-1-3b-sharegpt-logs

On average, this setup yielded a higher inference rate of around 8.48 tokens per second without compromising chat quality.

Sheared-LLaMA-1.3B is a model pruned and further pre-trained from meta-llama/Llama-2-7b-hf.

Jetson Examples

The jetson-examples repository by Seed Studio offers a straightforward, one-line command deployment for running Vision AI and Generative AI models on the NVIDIA Jetson platform.

To install the package, run:

pip3 install jetson-examples

jetson-text-install-jetson-examples

Before proceeding to next section, add the required path to .bashrc file:

export PATH=/home/pi/.local/bin:$PATH

Text (LLM) - Llama3

To run Llama3, use the one-line command:

reComputer run llama3

Here is an example of the container in action:

jetson-text-recomputer-run-llama3

Text (LLM) - Sheared-LLaMA-2.7B-ShareGPT

For speed comparison, to run Sheared LLaMA, use the one-line command:

reComputer run Sheared-LLaMA-2.7B-ShareGPT

Here is an another example of the container in action:

jetson-text-recomputer-run-sheared-llama-2-7b-sharegpt-logs

jetson-text-recomputer-run-sheared-llama-2-7b-sharegpt

Inference Server - Ollama

To run the Ollama inference server, use the command:

reComputer run ollama

jetson-text-recomputer-run-ollama

Once the Ollama inference server is running, you can interact with it by executing commands such as ollama run llama3 to start a chat session:

jetson-text-ollama-run-llama3

There’s much more to explore, but I’ll stop here for now. Enjoy your journey with Jetson!

This is a post in the NVIDIA Jetson Orin NX series.
Other posts in this series:

August 25, 2024 - Audio Generation with NVIDIA Jetson Orin NX
August 17, 2024 - Unleashing Text Generation with NVIDIA Jetson Orin NX
August 9, 2024 - Exploring NVIDIA Jetson Orin NX: Flashing and Setup Guide