Running llama.cpp on Linux: A CPU and NVIDIA GPU Guide
Discover the process of acquiring, compiling, and executing the llama.cpp code on a Linux environment in this detailed post.
Whether you’re excited about working with language models or simply wish to gain hands-on experience, this step-by-step tutorial helps you get started with llama.cpp
, available on GitHub.
Getting the llama.cpp Code
To get started, clone the llama.cpp
repository from GitHub by opening a terminal and executing the following commands:
|
|
These commands download the repository and navigate into the newly cloned directory.
Downloading the Models
Working with llama.cpp
requires language models. Two sources provide these, and you can run different models, not just LLaMa:
Download the models and place them in a directory. By default, this is the one inside the cloned repo. However, a different directory can be specified with the --model
flag when running the model.
Building llama.cpp
Two methods will be explained for building llama.cpp
: using only the CPU or leveraging the power of a GPU (in this case, NVIDIA).
Method 1: CPU Only
This method only requires using the make
command inside the cloned repository. This command compiles the code using only the CPU.
Method 2: NVIDIA GPU
For GPU-based compilation, installation of the NVIDIA CUDA toolkit is necessary. Although some distributions like Pop!_OS provide their own versions of the toolkit, downloading it from the official NVIDIA site is recommended.
After installing the CUDA toolkit, a system reboot is required. Everything is set up correctly if nvidia-smi
and nvcc --version
execute without error.
To build the code with CUDA support, execute the following command inside the llama.cpp
directory:
|
|
This enables offloading computations to the GPU when running the model using the --n-gpu-layers
flag.
Setting up a Python Environment with Conda
Before running llama.cpp
, it’s a good idea to set up an isolated Python environment. This can be achieved using Conda, a popular package and environment manager for Python. To install Conda, either follow the instructions or run the following script:
|
|
If you’d prefer to disable the base Conda environment that activates each time you open a terminal, run conda config --set auto_activate_base false
.
Running the Model
With the building process complete, the running of llama.cpp
begins. Start by creating a new Conda environment and activating it:
|
|
Next, install the necessary Python packages from the requirements.txt
file:
|
|
The model can now be converted to fp16
and quantized to make it smaller, more performant, and runnable on consumer hardware:
|
|
Finally, run the model. If you built the project using only the CPU, do not use the --n-gpu-layers
flag. If you used an NVIDIA GPU, utilize this flag to offload computations to the GPU. Set the number of layers to offload based on your VRAM capacity, increasing the number gradually until you find a sweet spot. To offload everything to the GPU, set the number to a very high value (like 15000
):
|
|
Understanding the Flags
To find out more about the available flags and their function, run --help
on the different binaries in the repo, or check the README.md
in the llama.cpp
repository. For example, to see all available quantization methods, run ./quantize --help
.
Setting the Model’s Behavior
The model’s behavior can be set by using the --prompt
flag. For example, providing an instruction in the chat-with-bob.txt
file and using it as a prompt makes the model follow this instruction when generating text.
If you find this post helpful, please consider supporting the blog. Your contributions help sustain the development and sharing of great content. Your support is greatly appreciated!
Buy Me a Coffee