Llama-cpp: Difference between revisions
Notes about AMD ROCm |
improve styles and added troubleshooting |
||
| Line 19: | Line 19: | ||
==== in NixOS ==== | ==== in NixOS ==== | ||
After enable Unfree software in NixOS add CUDA to your packages | After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos"> | ||
{ | |||
< | environment.systemPackages = [ | ||
environment.systemPackages = [ | (pkgs.llama-cpp.override { cudaSupport = true; }) | ||
]; | |||
]; | } | ||
</ | </syntaxhighlight>And do a switch to the new configuration | ||
And do a switch to the new configuration | |||
sudo nixos-rebuild switch | sudo nixos-rebuild switch | ||
==== in a shell ==== | ==== in a shell ==== | ||
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell: | If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash"> | ||
< | |||
export NIXPKGS_ALLOW_UNFREE=1 | export NIXPKGS_ALLOW_UNFREE=1 | ||
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }' | nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }' | ||
</ | </syntaxhighlight> | ||
=== BLAS Support === | === BLAS Support === | ||
BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing: | BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos"> | ||
{ | |||
< | environment.systemPackages = [ | ||
environment.systemPackages = [ | (pkgs.llama-cpp.override { blasSupport = true; }) | ||
]; | |||
]; | } | ||
</ | </syntaxhighlight> | ||
=== AMD ROCm === | === AMD ROCm === | ||
| Line 143: | Line 139: | ||
Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool. | Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool. | ||
In your shell: | In your shell:<syntaxhighlight lang="bash"> | ||
llama-cli \ | |||
< | -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \ | ||
llama-cli \ | --temp 1.0 --top-p 0.95 --top-k 40 \ | ||
-p "briefly explain journalctl in one paragraph" | |||
</syntaxhighlight> | |||
</ | |||
== llama-server == | == llama-server == | ||
| Line 156: | Line 150: | ||
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]]. | <code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]]. | ||
You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service. | You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.<syntaxhighlight lang="nixos"> | ||
{ | |||
services.llama-cpp = { | |||
enable = true; | |||
package = pkgs.llama-cpp-vulkan; | |||
# Takes care of downloading if model not present | |||
modelsPreset = { | |||
"Qwen3-Coder-Next" = { | |||
hf-repo = "unsloth/Qwen3-Coder-Next-GGUF"; | |||
hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf"; | |||
alias = "unsloth/Qwen3-Coder-Next"; | |||
temp = "1.0"; | |||
top-p = "0.95"; | |||
top-k = "40"; | |||
}; | |||
}; | |||
}; | |||
And do a switch to the new configuration | } | ||
</syntaxhighlight>And do a switch to the new configuration | |||
<pre> | <pre> | ||
sudo nixos-rebuild switch | sudo nixos-rebuild switch | ||
</pre> | </pre> | ||
=== Troubleshooting === | |||
==== Failed to create //.cache for shader cache ==== | |||
This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nix"> | |||
systemd.services.llama-cpp = { | |||
environment = { | |||
XDG_CACHE_HOME = "/var/cache/llama-cpp"; | |||
MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp"; | |||
}; | |||
}; | |||
</syntaxhighlight> | |||
Revision as of 06:50, 22 May 2026
The llama-cpp package in nixpkgs contains several tools provided by the llama.cpp repository.
A non-exhaustive example includes: llama-cli, llama-server, and llama-bench
The package comes in 3 flavors:
llama-cpp: the umbrella package, it uses the CPU if it doesn't find any GPU. On Mac Sillicon, it automatically detects that it should use the Metal backend. And for NVIDIA CUDA, you need to enable cudaSupport and unfree packages.llama-cpp-rocm: for AMD ROCm software stack. Under the shell, it's justllama-cppwith rocmSupport enabled.llama-cpp-vulkan: for Vulkan, which works with multiple CPU's and GPU's. Under the shell, it's justllama-cppwith vulkanSupport enabled. In some situations, it may perform even better than ROCm.
You can install any of the 3 in your system depending on your configuration. If your system is not covered by one of those packages, you can probably still install llama-cpp and with some customization make it fit your system
Customization
Nvidia CUDA
Nvidia CUDA contains Unfree software, so you have to enable it first, either in your NixOS configuration or via environmental variables.
in NixOS
After enable Unfree software in NixOS add CUDA to your packages
{
environment.systemPackages = [
(pkgs.llama-cpp.override { cudaSupport = true; })
];
}
And do a switch to the new configuration
sudo nixos-rebuild switch
in a shell
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:
export NIXPKGS_ALLOW_UNFREE=1
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
BLAS Support
BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:
{
environment.systemPackages = [
(pkgs.llama-cpp.override { blasSupport = true; })
];
}
AMD ROCm
Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.
E.g:
export HSA_OVERRIDE_GFX_VERSION='11.5.1'
| Arch | Version | Example card |
|---|---|---|
| RDNA 3 APU | 11.0.0 | e.g: 780M |
| Strix Point | 11.5.0 | e.g: 880M |
| Strix Halo | 11.5.1 | e.g: Radeon 8060S |
| RDNA 4 "Navi 48" | 12.0.1 | e.g: Radeon RX 9070 XT |
Models
When usage llama-cli or llama-server, you can tune the parameters of the model.
Open models, usually include a card in their model page explaining how to optimize the parameters for different tasks.
For example, Qwen3-Coder-Next-GGUF reads:
To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.
And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
Does it run on your machine?
To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.
When your GPU doesn't have enough RAM, with llama-cli or llama-server you can offload some of it to your system's RAM, by using the flag -ngl. Read the cli reference
Performance
Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.
| System | Memory bandwith | Est. Tokens/Sec (8B, Q4) | Notes |
|---|---|---|---|
| Nvidia RTX 5090 | 1792 GB/s | ~310 – 330 t/s | |
| Nvidia RTX 4090 | 1008 GB/s | ~180 – 200 t/s | |
| Apple M3 Ultra | 800 GB/s | ~145 – 155 t/s | |
| Radeon RX 9070 XT | 640 GB/s | ~110 – 125 t/s | |
| Strix Halo (AI Max+ 395+) | 256 GB/s | ~45 – 50 t/s | |
| Strix Point (HX 370) | 89 – 136 GB/s | ~12 – 25 t/s | Depends on the type of RAM used |
llama-cli
Once you've made llama-cpp available in your system. You can use llama-cli, which is a straightforward to use tool.
In your shell:
llama-cli \
-hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \
--temp 1.0 --top-p 0.95 --top-k 40 \
-p "briefly explain journalctl in one paragraph"
llama-server
llama-server runs a server, and it can run models on demand. It's quite similar to Ollama.
You can manually start the server from your terminal, it's usage, is not that different from llama-cli, but we are going to see the integration with NixOS as a service.
{
services.llama-cpp = {
enable = true;
package = pkgs.llama-cpp-vulkan;
# Takes care of downloading if model not present
modelsPreset = {
"Qwen3-Coder-Next" = {
hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
alias = "unsloth/Qwen3-Coder-Next";
temp = "1.0";
top-p = "0.95";
top-k = "40";
};
};
};
}
And do a switch to the new configuration
sudo nixos-rebuild switch
Troubleshooting
Failed to create //.cache for shader cache
This is a known issue (441531), until it gets fixed, you can add to your conf:
systemd.services.llama-cpp = {
environment = {
XDG_CACHE_HOME = "/var/cache/llama-cpp";
MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
};
};