Running llama.cpp on Linux: A CPU and NVIDIA GPU Guide

Discover the process of acquiring, compiling, and executing the llama.cpp code on a Linux environment in this detailed post.

Series - LLM Evaluations

Whether you’re excited about working with language models or simply wish to gain hands-on experience, this step-by-step tutorial helps you get started with llama.cpp, available on GitHub.

To get started, clone the llama.cpp repository from GitHub by opening a terminal and executing the following commands:

1
2
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

These commands download the repository and navigate into the newly cloned directory.

Working with llama.cpp requires language models. Two sources provide these, and you can run different models, not just LLaMa:

Download the models and place them in a directory. By default, this is the one inside the cloned repo. However, a different directory can be specified with the --model flag when running the model.

Two methods will be explained for building llama.cpp: using only the CPU or leveraging the power of a GPU (in this case, NVIDIA).

This method only requires using the make command inside the cloned repository. This command compiles the code using only the CPU.

For GPU-based compilation, installation of the NVIDIA CUDA toolkit is necessary. Although some distributions like Pop!_OS provide their own versions of the toolkit, downloading it from the official NVIDIA site is recommended.

Note
Proceed with this step at your own risk. If unsure, stick with the CPU-only method. Also, if you’re going to do it anyway, try to use the deb (network) guide when installing so you can get the latest drivers and the latest CUDA version (if that fits your needs).

After installing the CUDA toolkit, a system reboot is required. Everything is set up correctly if nvidia-smi and nvcc --version execute without error.

To build the code with CUDA support, execute the following command inside the llama.cpp directory:

1
make clean && LLAMA_CUBLAS=1 make -j

This enables offloading computations to the GPU when running the model using the --n-gpu-layers flag.

Before running llama.cpp, it’s a good idea to set up an isolated Python environment. This can be achieved using Conda, a popular package and environment manager for Python. To install Conda, either follow the instructions or run the following script:

1
2
curl -sL "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" > "Miniconda3.sh"
bash Miniconda3.sh

If you’d prefer to disable the base Conda environment that activates each time you open a terminal, run conda config --set auto_activate_base false.

With the building process complete, the running of llama.cpp begins. Start by creating a new Conda environment and activating it:

1
2
conda create -n llama-cpp python=3.10.9
conda activate llama-cpp

Next, install the necessary Python packages from the requirements.txt file:

1
python3 -m pip install -r requirements.txt

The model can now be converted to fp16 and quantized to make it smaller, more performant, and runnable on consumer hardware:

1
2
python3 convert.py <MODELS_DIRECTORY>/7B/
./quantize <MODELS_DIRECTORY>/7B/ggml-model-f16.bin <MODELS_DIRECTORY>/7B/ggml-model-q4_0.bin q4_0

Finally, run the model. If you built the project using only the CPU, do not use the --n-gpu-layers flag. If you used an NVIDIA GPU, utilize this flag to offload computations to the GPU. Set the number of layers to offload based on your VRAM capacity, increasing the number gradually until you find a sweet spot. To offload everything to the GPU, set the number to a very high value (like 15000):

1
./main --color --interactive --model <MODELS_DIRECTORY>/7B/ggml-model-q4_0.bin --n-predict 512 --repeat_penalty 1.0 --n-gpu-layers 15000 --reverse-prompt "User:" --in-prefix " " -f prompts/chat-with-bob.txt

To find out more about the available flags and their function, run --help on the different binaries in the repo, or check the README.md in the llama.cpp repository. For example, to see all available quantization methods, run ./quantize --help.

The model’s behavior can be set by using the --prompt flag. For example, providing an instruction in the chat-with-bob.txt file and using it as a prompt makes the model follow this instruction when generating text.