Llama-cpp: Difference between revisions

Latest revision as of 08:20, 22 May 2026

The llama-cpp package in nixpkgs contains several tools provided by the llama.cpp repository. A non-exhaustive example includes: llama-cli, llama-server, and llama-bench

The package comes in 3 flavors:

llama-cpp: the umbrella package, it uses the CPU if it doesn't find any GPU. On Mac Sillicon, it automatically detects that it should use the Metal backend. And for NVIDIA CUDA, you need to enable cudaSupport and unfree packages.
llama-cpp-rocm: for AMD ROCm software stack. Under the shell, it's just llama-cpp with rocmSupport enabled.
llama-cpp-vulkan: for Vulkan, which works with multiple CPU's and GPU's. Under the shell, it's just llama-cpp with vulkanSupport enabled. In some situations, it may perform even better than ROCm.

You can install any of the 3 in your system depending on your configuration. If your system is not covered by one of those packages, you can probably still install llama-cpp and with some customization make it fit your system

Customization

Nvidia CUDA

Nvidia CUDA contains Unfree software, so you have to enable it first, either in your NixOS configuration or via environmental variables.

in NixOS

After enable Unfree software in NixOS add CUDA to your packages

{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { cudaSupport = true; })
  ];
}

And do a switch to the new configuration

sudo nixos-rebuild switch

in a shell

If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:

export NIXPKGS_ALLOW_UNFREE=1
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'

BLAS Support

BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:

{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { blasSupport = true; })
  ];
}

AMD ROCm

Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.

E.g:

export HSA_OVERRIDE_GFX_VERSION='11.5.1'


Arch	Version	Example card
RDNA 3 APU	11.0.0	780M
Strix Point	11.5.0	880M
Strix Halo	11.5.1	Radeon 8060S
RDNA 4 "Navi 48"	12.0.1	Radeon RX 9070 XT

Models

When usage llama-cli or llama-server, you can tune the parameters of the model.

Open models, usually include a card in their model page explaining how to optimize the parameters for different tasks.

For example, Qwen3-Coder-Next-GGUF reads:

To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.

And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.

How much RAM do I need?

To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.

A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.

When your GPU doesn't have enough RAM, with llama-cli or llama-server you can offload some of it to your system's RAM, by using the flag -ngl. Read the cli reference

What are Mixture of Experts (MoE)?

MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).

This ends up improving performance, but the model still needs to fit in RAM.

For example, Qwen3.6-35B-A3B has 35B param count, but only 3B are active.

Performance

Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.

Don't confuse GPU's memory bandwidth with GPU's memory (RAM).


System	Memory bandwith	Est. Tokens/Sec (8B, Q4)	Notes
Nvidia RTX 5090	1792 GB/s	~310 – 330 t/s
Nvidia RTX 4090	1008 GB/s	~180 – 200 t/s
Apple M3 Ultra	800 GB/s	~145 – 155 t/s
Radeon RX 9070 XT	640 GB/s	~110 – 125 t/s
Strix Halo (AI Max+ 395+)	256 GB/s	~45 – 50 t/s
Strix Point (HX 370)	89 – 136 GB/s	~12 – 25 t/s	Depends on the type of RAM used

llama-cli

Once you've made llama-cpp available in your system. You can use llama-cli, which is a straightforward to use tool.

In your shell:

llama-cli \ 
  -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \ 
  --temp 1.0 --top-p 0.95 --top-k 40 \ 
  -p "briefly explain journalctl in one paragraph"

llama-server

llama-server runs a server, and it can run models on demand. It's quite similar to Ollama.

You can manually start the server from your terminal, it's usage, is not that different from llama-cli, but we are going to see the integration with NixOS as a service.

{
  services.llama-cpp = {
    enable = true;
    package = pkgs.llama-cpp-vulkan;
    # Takes care of downloading if model not present
    modelsPreset = {
      "Qwen3-Coder-Next" = {
        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
        alias = "unsloth/Qwen3-Coder-Next";
        temp = "1.0";
        top-p = "0.95";
        top-k = "40";
      };
    };
  };
}

And do a switch to the new configuration

sudo nixos-rebuild switch

Troubleshooting

Failed to create //.cache for shader cache

This is a known issue (441531), until it gets fixed, you can add to your conf:

{
  systemd.services.llama-cpp = {
    environment = {
      XDG_CACHE_HOME = "/var/cache/llama-cpp";
      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
    };
  };
}

@@ Line 46: / Line 46: @@
 === AMD ROCm ===
-Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.
+Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable.
 E.g:<syntaxhighlight lang="bash">
@@ Line 59: / Line 59: @@
 |RDNA 3 APU
 |11.0.0
-|e.g: 780M
+|780M
 |-
 |Strix Point
 |11.5.0
-|e.g: 880M
+|880M
 |-
 |Strix Halo
 |11.5.1
-|e.g: Radeon 8060S
+|Radeon 8060S
 |-
 |RDNA 4 "Navi 48"
 |12.0.1
-|e.g: Radeon RX 9070 XT
+|Radeon RX 9070 XT
 |}
-Models
+== Models ==
 When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.
@@ Line 86: / Line 86: @@
 And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
-=== Does it run on your machine? ===
+=== How much RAM do I need? ===
 To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
@@ Line 92: / Line 92: @@
 A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.
-When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
+When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
+=== What are Mixture of Experts (MoE)? ===
+MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).
+This ends up improving performance, but the model still needs to fit in RAM.
+For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active.
 === Performance ===
-Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.
+Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.
+Don't confuse GPU's memory bandwidth with GPU's memory (RAM).
 {| class="wikitable"
 |+
@@ Line 178: / Line 187: @@
 ==== Failed to create //.cache for shader cache ====
-This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nix">
+This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos">
-systemd.services.llama-cpp = {
+{
-  environment = {
+  systemd.services.llama-cpp = {
-    XDG_CACHE_HOME = "/var/cache/llama-cpp";
+    environment = {
-    MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
+      XDG_CACHE_HOME = "/var/cache/llama-cpp";
+      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
+    };
    };
-};
+}
 </syntaxhighlight>