LLaMa Performance Benchmarking with llama.cpp on NVIDIA 3070 Ti

Explore how the LLaMa language model from Meta AI performs in various benchmarks using llama.cpp on an advanced desktop configuration.

Series - LLM Evaluations

In our constant pursuit of knowledge and efficiency, it’s crucial to understand how artificial intelligence (AI) models perform under different configurations and hardware. By comparing the original four versions (7B, 13B, 30B, 65B) of the model under varying conditions, the aim is to provide valuable insights into model performance and resource utilization on this particular hardware.

The test machine is a desktop with 32GB of RAM, powered by an AMD Ryzen 9 5900x CPU and an NVIDIA RTX 3070 Ti GPU with 8GB of VRAM. The models were tested using the Q4_0 quantization method, known for significantly reducing the model size albeit at the cost of quality loss.

Each test followed a specific procedure, involving environment setup, model conversion, quantization, and interaction. Below is the code sequence used for the tests, and make sure to update the variables with your own values if you are testing this yourself:

# Set environment variables

# Convert the model to fp16
python3 convert.py $MODEL_DIRECTORY/$MODEL_SIZE

# Quantize the model to Q4_0
./quantize $MODEL_DIRECTORY/$MODEL_SIZE/ggml-model-f16.bin $MODEL_DIRECTORY/$MODEL_SIZE/ggml-model-q4_0.bin q4_0

# Interact with the model
./main --color --interactive --model $MODEL_DIRECTORY/$MODEL_SIZE/ggml-model-q4_0.bin --n-predict 128 --repeat_penalty 1.0 --n-gpu-layers $GPU_LAYERS_OFFLOAD --reverse-prompt "User:" --in-prefix " " --prompt "Transcript of a dialog, where the User interacts with an Assistant named Tony. Tony is an all knowing being, he is kind, honest, and helpful. He is an experienced software engineer as well. He never fails to respond to the User with utmost precision and knowledge."

This process was repeated for each of the four model sizes, and the tests were conducted both with and without GPU layer offloading.

With default cuBLAS GPU acceleration, the 7B model clocked in at approximately 9.8 tokens per second. However, with full offloading of all 35 layers, this figure jumped to 33.9 tokens per second.

The 13B version, using default cuBLAS GPU acceleration, returned approximately 5.3 tokens per second. With partial offloading of 26 out of 43 layers (limited by VRAM), the speed increased to 9.3 tokens per second.

The 30B model achieved roughly 2.2 tokens per second using default cuBLAS GPU acceleration. Despite offloading 14 out of 63 layers (limited by VRAM), the speed only slightly improved to 2.7 tokens per second.

The largest 65B version returned just 0.08 tokens per second using default cuBLAS GPU acceleration. Offloading 5 out of 83 layers (limited by VRAM) led to a negligible improvement, clocking in at approximately 0.09 tokens per second.

Initial findings suggest that layer offloading significantly boosts performance for smaller models. However, as model size increases, the benefits of offloading diminish due to the limitations imposed by available VRAM.