Jump to content

Llama-cpp: Difference between revisions

From Official NixOS Wiki
Woile (talk | contribs)
Add llama-cpp with examples and cli usage
 
Woile (talk | contribs)
new heading MoE and styling improvements
 
(7 intermediate revisions by the same user not shown)
Line 6: Line 6:


* <code>llama-cpp</code>: the umbrella package, it uses the CPU if it doesn't find any GPU. On Mac Sillicon, it automatically detects that it should use the Metal backend. And for NVIDIA CUDA, you need to enable cudaSupport and unfree packages.
* <code>llama-cpp</code>: the umbrella package, it uses the CPU if it doesn't find any GPU. On Mac Sillicon, it automatically detects that it should use the Metal backend. And for NVIDIA CUDA, you need to enable cudaSupport and unfree packages.
* <code>llama-cpp-rocm</code>: for [https://en.wikipedia.org/wiki/ROCm AMD ROCm] software stack. Under the shell, it's just <code>llama-cpp</code> with <code>rocmSupport = true</code>.
* <code>llama-cpp-rocm</code>: for [https://en.wikipedia.org/wiki/ROCm AMD ROCm] software stack. Under the shell, it's just <code>llama-cpp</code> with rocmSupport enabled.
* <code>llama-cpp-vulkan</code>: for [https://en.wikipedia.org/wiki/Vulkan Vulkan], which works with multiple CPU's and GPU's. Under the shell, it's just <code>llama-cpp</code> with <code>vulkanSupport = true</code>. In some situations, it may perform even better than ROCm.
* <code>llama-cpp-vulkan</code>: for [https://en.wikipedia.org/wiki/Vulkan Vulkan], which works with multiple CPU's and GPU's. Under the shell, it's just <code>llama-cpp</code> with vulkanSupport enabled. In some situations, it may perform even better than ROCm.


You can install any of the 3 in your system depending on your configuration. If your system is not covered by one of those packages, you can probably still install <code>llama-cpp</code> and with some customization make it fit your system
You can install any of the 3 in your system depending on your configuration. If your system is not covered by one of those packages, you can probably still install <code>llama-cpp</code> and with some customization make it fit your system


== customization ==
== Customization ==


=== Nvidia CUDA ===
=== Nvidia CUDA ===
Line 17: Line 17:
Nvidia CUDA contains [[Unfree software]], so you have to enable it first, either in your NixOS configuration or via environmental variables.
Nvidia CUDA contains [[Unfree software]], so you have to enable it first, either in your NixOS configuration or via environmental variables.


==== NixOS ====
==== in NixOS ====


After enable Unfree software in NixOS add CUDA to your packages
After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos">
{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { cudaSupport = true; })
  ];
}
</syntaxhighlight>And do a switch to the new configuration
sudo nixos-rebuild switch


<pre>
==== in a shell ====
environment.systemPackages = [
  (pkgs.llama-cpp.override { cudaSupport = true; })
];
</pre>
 
And do a switch to the new configuration
 
<pre>
sudo nixos-rebuild switch
</pre>
 
==== Creating a shell ====
 
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:


<pre>
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash">
export NIXPKGS_ALLOW_UNFREE=1
export NIXPKGS_ALLOW_UNFREE=1
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
</pre>
</syntaxhighlight>


=== BLAS Support ===
=== BLAS Support ===


BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:
BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos">
{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { blasSupport = true; })
  ];
}
</syntaxhighlight>


<pre>
=== AMD ROCm ===
environment.systemPackages = [
Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable.
  (pkgs.llama-cpp.override { blasSupport = true; })
];
</pre>


== models ==
E.g:<syntaxhighlight lang="bash">
export HSA_OVERRIDE_GFX_VERSION='11.5.1'
</syntaxhighlight>
{| class="wikitable"
|+
!Arch
!Version
!Example card
|-
|RDNA 3 APU
|11.0.0
|780M
|-
|Strix Point
|11.5.0
|880M
|-
|Strix Halo
|11.5.1
|Radeon 8060S
|-
|RDNA 4 "Navi 48"
|12.0.1
|Radeon RX 9070 XT
|}


== Models ==
When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.  
When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.  


Line 65: Line 86:


And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
=== Does it run on your machine? ===
=== How much RAM do I need? ===


To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
Line 71: Line 92:
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.


When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
 
=== What are Mixture of Experts (MoE)? ===
MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).
 
This ends up improving performance, but the model still needs to fit in RAM.
 
For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active.


=== Performance ===
=== Performance ===


Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.
Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.  


* Nvidia RTX 5090 | 1792 GB/s
Don't confuse GPU's memory bandwidth with GPU's memory (RAM).
* Nvidia RTX 4090 | 1008 GB/s
{| class="wikitable"
* Apple M3 Ultra | 800 GB/s
|+
* Radeon RX 9070 XT | 640 GB/s
!System
* Strix Halo (AI Max+ 395+) | 256 GB/s
!Memory bandwith
* Strix Point (HX 370) | 89 – 136 GB/s
!Est. Tokens/Sec (8B, Q4)
!Notes
|-
|Nvidia RTX 5090
|1792 GB/s
|~310 – 330 t/s
|
|-
|Nvidia RTX 4090
|1008 GB/s
|~180 – 200 t/s
|
|-
|Apple M3 Ultra
|800 GB/s
|~145 – 155 t/s
|
|-
|Radeon RX 9070 XT
|640 GB/s
|~110 – 125 t/s
|
|-
|Strix Halo (AI Max+ 395+)
|256 GB/s
|~45 – 50 t/s
|
|-
|Strix Point (HX 370)
|89 – 136 GB/s
|~12 – 25 t/s
|Depends on the type of RAM used
|}


== llama-cli ==  
== llama-cli ==  


Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, a straightforward to use tool.  
Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool.  


In your shell:
In your shell:<syntaxhighlight lang="bash">
 
llama-cli \  
<pre>
  -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \  
llama-cli \
  --temp 1.0 --top-p 0.95 --top-k 40 \  
    -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \
  -p "briefly explain journalctl in one paragraph"
    --temp 1.0 --top-p 0.95 --top-k 40 \
</syntaxhighlight>
    -p "briefly explain journalctl in one paragraph"
</pre>


== llama-server ==
== llama-server ==
Line 101: Line 159:
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].


You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.
You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.<syntaxhighlight lang="nixos">
 
{
<pre>
  services.llama-cpp = {
services.llama-cpp = {
    enable = true;
  enable = true;
    package = pkgs.llama-cpp-vulkan;
  package = pkgs.llama-cpp-vulkan;
    # Takes care of downloading if model not present
 
    modelsPreset = {
  # Takes care of downloading if model not present
      "Qwen3-Coder-Next" = {
  modelsPreset = {
        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
    "Qwen3-Coder-Next" = {
        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
      hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
        alias = "unsloth/Qwen3-Coder-Next";
      hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
        temp = "1.0";
      alias = "unsloth/Qwen3-Coder-Next";
        top-p = "0.95";
      temp = "1.0";
        top-k = "40";
      top-p = "0.95";
      };
      top-k = "40";
     };
     };
   };
   };
};
}
</pre>


And do a switch to the new configuration
</syntaxhighlight>And do a switch to the new configuration


<pre>
<pre>
sudo nixos-rebuild switch
sudo nixos-rebuild switch
</pre>
</pre>
=== Troubleshooting ===
==== Failed to create //.cache for shader cache ====
This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos">
{
  systemd.services.llama-cpp = {
    environment = {
      XDG_CACHE_HOME = "/var/cache/llama-cpp";
      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
    };
  };
}
</syntaxhighlight>

Latest revision as of 08:20, 22 May 2026

The llama-cpp package in nixpkgs contains several tools provided by the llama.cpp repository. A non-exhaustive example includes: llama-cli, llama-server, and llama-bench

The package comes in 3 flavors:

  • llama-cpp: the umbrella package, it uses the CPU if it doesn't find any GPU. On Mac Sillicon, it automatically detects that it should use the Metal backend. And for NVIDIA CUDA, you need to enable cudaSupport and unfree packages.
  • llama-cpp-rocm: for AMD ROCm software stack. Under the shell, it's just llama-cpp with rocmSupport enabled.
  • llama-cpp-vulkan: for Vulkan, which works with multiple CPU's and GPU's. Under the shell, it's just llama-cpp with vulkanSupport enabled. In some situations, it may perform even better than ROCm.

You can install any of the 3 in your system depending on your configuration. If your system is not covered by one of those packages, you can probably still install llama-cpp and with some customization make it fit your system

Customization

Nvidia CUDA

Nvidia CUDA contains Unfree software, so you have to enable it first, either in your NixOS configuration or via environmental variables.

in NixOS

After enable Unfree software in NixOS add CUDA to your packages

{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { cudaSupport = true; })
  ];
}

And do a switch to the new configuration

sudo nixos-rebuild switch

in a shell

If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:

export NIXPKGS_ALLOW_UNFREE=1
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'

BLAS Support

BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:

{
  environment.systemPackages = [
    (pkgs.llama-cpp.override { blasSupport = true; })
  ];
}

AMD ROCm

Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.

E.g:

export HSA_OVERRIDE_GFX_VERSION='11.5.1'
Arch Version Example card
RDNA 3 APU 11.0.0 780M
Strix Point 11.5.0 880M
Strix Halo 11.5.1 Radeon 8060S
RDNA 4 "Navi 48" 12.0.1 Radeon RX 9070 XT

Models

When usage llama-cli or llama-server, you can tune the parameters of the model.

Open models, usually include a card in their model page explaining how to optimize the parameters for different tasks.

For example, Qwen3-Coder-Next-GGUF reads:

To achieve optimal performance, we recommend the following sampling parameters: temperature=1.0, top_p=0.95, top_k=40.

And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.

How much RAM do I need?

To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.

A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.

When your GPU doesn't have enough RAM, with llama-cli or llama-server you can offload some of it to your system's RAM, by using the flag -ngl. Read the cli reference

What are Mixture of Experts (MoE)?

MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).

This ends up improving performance, but the model still needs to fit in RAM.

For example, Qwen3.6-35B-A3B has 35B param count, but only 3B are active.

Performance

Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.

Don't confuse GPU's memory bandwidth with GPU's memory (RAM).

System Memory bandwith Est. Tokens/Sec (8B, Q4) Notes
Nvidia RTX 5090 1792 GB/s ~310 – 330 t/s
Nvidia RTX 4090 1008 GB/s ~180 – 200 t/s
Apple M3 Ultra 800 GB/s ~145 – 155 t/s
Radeon RX 9070 XT 640 GB/s ~110 – 125 t/s
Strix Halo (AI Max+ 395+) 256 GB/s ~45 – 50 t/s
Strix Point (HX 370) 89 – 136 GB/s ~12 – 25 t/s Depends on the type of RAM used

llama-cli

Once you've made llama-cpp available in your system. You can use llama-cli, which is a straightforward to use tool.

In your shell:

llama-cli \ 
  -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \ 
  --temp 1.0 --top-p 0.95 --top-k 40 \ 
  -p "briefly explain journalctl in one paragraph"

llama-server

llama-server runs a server, and it can run models on demand. It's quite similar to Ollama.

You can manually start the server from your terminal, it's usage, is not that different from llama-cli, but we are going to see the integration with NixOS as a service.

{
  services.llama-cpp = {
    enable = true;
    package = pkgs.llama-cpp-vulkan;
    # Takes care of downloading if model not present
    modelsPreset = {
      "Qwen3-Coder-Next" = {
        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
        alias = "unsloth/Qwen3-Coder-Next";
        temp = "1.0";
        top-p = "0.95";
        top-k = "40";
      };
    };
  };
}

And do a switch to the new configuration

sudo nixos-rebuild switch

Troubleshooting

Failed to create //.cache for shader cache

This is a known issue (441531), until it gets fixed, you can add to your conf:

{
  systemd.services.llama-cpp = {
    environment = {
      XDG_CACHE_HOME = "/var/cache/llama-cpp";
      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
    };
  };
}