LLaMa Performance Benchmarking with llama.cpp on NVIDIA 3070 Ti
Explore how the LLaMa language model from Meta AI performs in various benchmarks using llama.cpp on an advanced desktop configuration.
In our constant pursuit of knowledge and efficiency, it’s crucial to understand how artificial intelligence (AI) models perform under different configurations and hardware. By comparing the original four versions (7B, 13B, 30B, 65B) of the model under varying conditions, the aim is to provide valuable insights into model performance and resource utilization on this particular hardware.
Test Setup
The test machine is a desktop with 32GB of RAM, powered by an AMD Ryzen 9 5900x CPU and an NVIDIA RTX 3070 Ti GPU with 8GB of VRAM. The models were tested using the Q4_0
quantization method, known for significantly reducing the model size albeit at the cost of quality loss.
The Testing Procedure
Each test followed a specific procedure, involving environment setup, model conversion, quantization, and interaction. Below is the code sequence used for the tests, and make sure to update the variables with your own values if you are testing this yourself:
|
|
This process was repeated for each of the four model sizes, and the tests were conducted both with and without GPU layer offloading.
Performance of 7B Version
With default cuBLAS GPU acceleration, the 7B model clocked in at approximately 9.8 tokens per second
. However, with full offloading of all 35 layers, this figure jumped to 33.9 tokens per second
.
Performance of 13B Version
The 13B version, using default cuBLAS GPU acceleration, returned approximately 5.3 tokens per second
. With partial offloading of 26 out of 43 layers (limited by VRAM), the speed increased to 9.3 tokens per second
.
Performance of 30B Version
The 30B model achieved roughly 2.2 tokens per second
using default cuBLAS GPU acceleration. Despite offloading 14 out of 63 layers (limited by VRAM), the speed only slightly improved to 2.7 tokens per second
.
Performance of 65B Version
The largest 65B version returned just 0.08 tokens per second
using default cuBLAS GPU acceleration. Offloading 5 out of 83 layers (limited by VRAM) led to a negligible improvement, clocking in at approximately 0.09 tokens per second
.
Initial Analysis
Initial findings suggest that layer offloading significantly boosts performance for smaller models. However, as model size increases, the benefits of offloading diminish due to the limitations imposed by available VRAM.
If you find this post helpful, please consider supporting the blog. Your contributions help sustain the development and sharing of great content. Your support is greatly appreciated!
Buy Me a Coffee