Llama-cpp: Difference between revisions

(9 intermediate revisions by the same user not shown)

Line 19:

==== in NixOS ====

After enable Unfree software in NixOS add CUDA to your packages

After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos">{

environment.systemPackages = [

<~~pre~~>

(pkgs.llama-cpp.override { cudaSupport = true; })

environment.systemPackages = [

];

(pkgs.llama-cpp.override { cudaSupport = true; })

}</syntaxhighlight>And do a switch to the new configuration

];

</~~pre~~>

And do a switch to the new configuration

sudo nixos-rebuild switch

==== in a shell ====

If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:

If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash">

<~~pre~~>

export NIXPKGS_ALLOW_UNFREE=1

nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'

</~~pre~~>

</syntaxhighlight>

=== BLAS Support ===

BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:

BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos">

{

<~~pre~~>

environment.systemPackages = [

(pkgs.llama-cpp.override { blasSupport = true; })

];

}

</~~pre~~>

</syntaxhighlight>

=== AMD ROCm ===

Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.

Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable.

E.g:<syntaxhighlight lang="bash">

Line 63:

Line 57:

|RDNA 3 APU

|11.0.0

|~~e.g:~~ 780M

|780M

|-

|Strix Point

|11.5.0

|~~e.g:~~ 880M

|880M

|-

|Strix Halo

|11.5.1

|~~e.g:~~ Radeon 8060S

|Radeon 8060S

|-

|RDNA 4 "Navi 48"

|12.0.1

|~~e.g:~~ Radeon RX 9070 XT

|Radeon RX 9070 XT

|}

~~Models~~

== Models ==

When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.

Line 90:

Line 84:

And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.

=== ~~Does it run on your machine~~? ===

=== How much RAM do I need? ===

To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.

Line 96:

Line 90:

A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.

When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]

When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]

=== What are Mixture of Experts (MoE)? ===

MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).

This ends up improving performance, but the model still needs to fit in RAM.

For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active.

=== Performance ===

Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.

Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.

Don't confuse GPU's memory bandwidth with GPU's memory (RAM).

{| class="wikitable"

|+

Line 143:

Line 146:

Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool.

In your ~~shell~~:

In your terminal try one of these (if they don't work, check you are running the latest llama-cpp version):<syntaxhighlight lang="bash"># LFM2.5-8B-A1B - Requires 8GB VRAM

llama-cli \

-hf unsloth/LFM2.5-8B-A1B-GGUF:UD-Q4_K_XL \

--temp 0.2 --top-p 0.95 --top-k 80 \

--repeat-penalty 1.05 \

-p "briefly explain journalctl in one paragraph"

~~<pre>~~

# Qwen3-Coder-Next - Requires 56GB VRAM

llama-cli \

-hf ~~bartowski~~/~~Qwen_Qwen3~~-Coder-Next-GGUF:Q4_K_M \

-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M \

--temp 1.0 --top-p 0.95 --top-k 40 \

-p "briefly explain journalctl in one paragraph"

-p "briefly explain journalctl in one paragraph"</syntaxhighlight>

</~~pre~~>

== llama-server ==

<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].

<code>llama-server</code> runs a server, and it can run models on demand. It supports OpenAI API standard. It's quite similar to [[Ollama]].

You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>.

Try any of these models <syntaxhighlight lang="bash"># LFM2.5-8B-A1B - Requires 8GB VRAM

llama-server \

-hf unsloth/LFM2.5-8B-A1B-GGUF:UD-Q4_K_XL \

--temp 0.2 --top-p 0.95 --top-k 80 \

--repeat-penalty 1.05

# Qwen3-Coder-Next - Requires 56GB VRAM

llama-server \

-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M \

--temp 1.0 --top-p 0.95 --top-k 40</syntaxhighlight>

Or alternatively, you can '''enable the NixOS service''' for llama-cpp, which runs the server.{{Warning|Pay attention, that the service is actually called llama-cpp not llama-server}}<syntaxhighlight lang="nixos">{

services.llama-cpp = {

enable = true;

package = pkgs.llama-cpp-vulkan;

# package = (pkgs.llama-cpp.override { cudaSupport = true; })

# package = pkgs.llama-cpp-rocm;

~~You can manually start the server from your terminal, it's usage, is~~ not ~~that different from <code>llama~~-~~cli<~~/~~code>, but we are going to see the integration with NixOS as a service~~.

# Takes care of downloading if model not present

~~services~~.~~llama~~-~~cpp~~ = {

modelsPreset = {

~~enable~~ = ~~true~~;

# Requires 8GB VRAM

~~package~~ = ~~pkgs~~.~~llama~~-~~cpp-vulkan~~;

"LFM2.5-8B-A1B" = {

hf-repo = "unsloth/LFM2.5-8B-A1B-GGUF";

# ~~Takes care of downloading if model not present~~

hf-file = "LFM2.5-8B-A1B-UD-Q4_K_XL.gguf";

~~modelsPreset = {~~

alias = "unsloth/LFM2.5-8B-A1B-GGUF";

"Qwen3-Coder-Next" = {

temp = "0.2";

hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";

repeat-penalty = "1.05";

hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";

top-k = "80";

alias = "unsloth/Qwen3-Coder-Next";

};

temp = "1.0";

# Requires 56GB VRAM

top-p = "0.95";

"Qwen3-Coder-Next" = {

top-k = "40";

hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";

};

hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";

};

alias = "unsloth/Qwen3-Coder-Next";

};

temp = "1.0";

And do a switch to the new configuration

top-p = "0.95";

top-k = "40";

};

}</syntaxhighlight>And do a switch to the new configuration

<pre>

sudo nixos-rebuild switch

</pre>

=== Web UI ===

The llama-cpp service includes a web interface, where you can chat. To access you must navigate to http://localhost:8080 . Or the <code>services.llama-cpp.port</code> configured.

=== Troubleshooting ===

==== Failed to create //.cache for shader cache ====

This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos">

{

systemd.services.llama-cpp = {

environment = {

XDG_CACHE_HOME = "/var/cache/llama-cpp";

MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";

};

}

</syntaxhighlight>

==== Migration to nixos-unstable (RFC42) ====

Since [https://github.com/NixOS/rfcs/blob/master/rfcs/0042-config-option.md RFC42] was approved, services are being migrated to use <code>.settings</code>, including <code>llama-cpp</code>. This is already the case for <code>nixos-unstable</code>. If you are using unstable, this is how you can migrate your service:<syntaxhighlight lang="diff">{

services.llama-cpp = {

enable = true;

package = pkgs.llama-cpp-vulkan;

- port = 8083;

- modelsPreset = {

+ settings.port = 8083;

+ settings.models-preset = (pkgs.formats.ini { }).generate "models-preset.ini" {

"Qwen3-Coder-Next" = {

hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";

hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";

alias = "unsloth/Qwen3-Coder-Next";

temp = "1.0";

top-p = "0.95";

top-k = "40";

};

}</syntaxhighlight>{{Note|TODO: When current unstable becomes stable, remove this troubleshooting and update `llama-server` section}}

@@ Line 19: / Line 19: @@
 ==== in NixOS ====
-After enable Unfree software in NixOS add CUDA to your packages
+After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos">{
+  environment.systemPackages = [
-<pre>
+    (pkgs.llama-cpp.override { cudaSupport = true; })
-environment.systemPackages = [
+  ];
-  (pkgs.llama-cpp.override { cudaSupport = true; })
+}</syntaxhighlight>And do a switch to the new configuration
-];
-</pre>
-And do a switch to the new configuration
   sudo nixos-rebuild switch
 ==== in a shell ====
-If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:
+If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash">
-<pre>
 export NIXPKGS_ALLOW_UNFREE=1
 nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
-</pre>
+</syntaxhighlight>
 === BLAS Support ===
-BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:
+BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos">
+{
-<pre>
+  environment.systemPackages = [
-environment.systemPackages = [
+    (pkgs.llama-cpp.override { blasSupport = true; })
-  (pkgs.llama-cpp.override { blasSupport = true; })
+  ];
-];
+}
-</pre>
+</syntaxhighlight>
 === AMD ROCm ===
-Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.
+Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable.
 E.g:<syntaxhighlight lang="bash">
@@ Line 63: / Line 57: @@
 |RDNA 3 APU
 |11.0.0
-|e.g: 780M
+|780M
 |-
 |Strix Point
 |11.5.0
-|e.g: 880M
+|880M
 |-
 |Strix Halo
 |11.5.1
-|e.g: Radeon 8060S
+|Radeon 8060S
 |-
 |RDNA 4 "Navi 48"
 |12.0.1
-|e.g: Radeon RX 9070 XT
+|Radeon RX 9070 XT
 |}
-Models
+== Models ==
 When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.
@@ Line 90: / Line 84: @@
 And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
-=== Does it run on your machine? ===
+=== How much RAM do I need? ===
 To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
@@ Line 96: / Line 90: @@
 A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.
-When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
+When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
+=== What are Mixture of Experts (MoE)? ===
+MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).
+This ends up improving performance, but the model still needs to fit in RAM.
+For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active.
 === Performance ===
-Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.
+Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.
+Don't confuse GPU's memory bandwidth with GPU's memory (RAM).
 {| class="wikitable"
 |+
@@ Line 143: / Line 146: @@
 Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool.
-In your shell:
+In your terminal try one of these (if they don't work, check you are running the latest llama-cpp version):<syntaxhighlight lang="bash"># LFM2.5-8B-A1B - Requires 8GB VRAM
+llama-cli \
+  -hf unsloth/LFM2.5-8B-A1B-GGUF:UD-Q4_K_XL \
+  --temp 0.2 --top-p 0.95 --top-k 80 \
+  --repeat-penalty 1.05 \
+  -p "briefly explain journalctl in one paragraph"
-<pre>
+# Qwen3-Coder-Next - Requires 56GB VRAM
 llama-cli \
-    -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \
+  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M \
-    --temp 1.0 --top-p 0.95 --top-k 40 \
+  --temp 1.0 --top-p 0.95 --top-k 40 \
-    -p "briefly explain journalctl in one paragraph"
+  -p "briefly explain journalctl in one paragraph"</syntaxhighlight>
-</pre>
 == llama-server ==
-<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].
+<code>llama-server</code> runs a server, and it can run models on demand. It supports OpenAI API standard. It's quite similar to [[Ollama]].
+You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>.
+Try any of these models <syntaxhighlight lang="bash"># LFM2.5-8B-A1B - Requires 8GB VRAM
+llama-server \
+  -hf unsloth/LFM2.5-8B-A1B-GGUF:UD-Q4_K_XL \
+  --temp 0.2 --top-p 0.95 --top-k 80 \
+  --repeat-penalty 1.05
+# Qwen3-Coder-Next - Requires 56GB VRAM
+llama-server \
+    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M \
+    --temp 1.0 --top-p 0.95 --top-k 40</syntaxhighlight>
+Or alternatively, you can '''enable the NixOS service''' for llama-cpp, which runs the server.{{Warning|Pay attention, that the service is actually called llama-cpp not llama-server}}<syntaxhighlight lang="nixos">{
+  services.llama-cpp = {
+    enable = true;
+    package = pkgs.llama-cpp-vulkan;
+    # package = (pkgs.llama-cpp.override { cudaSupport = true; })
+    # package = pkgs.llama-cpp-rocm;
-You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.
+    # Takes care of downloading if model not present
- services.llama-cpp = {
+    modelsPreset = {
-   enable = true;
+      # Requires 8GB VRAM
-   package = pkgs.llama-cpp-vulkan;
+      "LFM2.5-8B-A1B" = {
+        hf-repo = "unsloth/LFM2.5-8B-A1B-GGUF";
-   # Takes care of downloading if model not present
+        hf-file = "LFM2.5-8B-A1B-UD-Q4_K_XL.gguf";
-   modelsPreset = {
+        alias = "unsloth/LFM2.5-8B-A1B-GGUF";
-     "Qwen3-Coder-Next" = {
+        temp = "0.2";
-       hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
+        repeat-penalty = "1.05";
-       hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
+        top-k = "80";
-       alias = "unsloth/Qwen3-Coder-Next";
+      };
-       temp = "1.0";
+      # Requires 56GB VRAM
-       top-p = "0.95";
+      "Qwen3-Coder-Next" = {
-       top-k = "40";
+        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
-     };
+        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
-   };
+        alias = "unsloth/Qwen3-Coder-Next";
- };
+        temp = "1.0";
-And do a switch to the new configuration
+        top-p = "0.95";
+        top-k = "40";
+      };
+    };
+  };
+}</syntaxhighlight>And do a switch to the new configuration
 <pre>
 sudo nixos-rebuild switch
 </pre>
+=== Web UI ===
+The llama-cpp service includes a web interface, where you can chat. To access you must navigate to http://localhost:8080 . Or the <code>services.llama-cpp.port</code> configured.
+=== Troubleshooting ===
+==== Failed to create //.cache for shader cache ====
+This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos">
+{
+  systemd.services.llama-cpp = {
+    environment = {
+      XDG_CACHE_HOME = "/var/cache/llama-cpp";
+      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
+    };
+  };
+}
+</syntaxhighlight>
+==== Migration to nixos-unstable (RFC42) ====
+Since [https://github.com/NixOS/rfcs/blob/master/rfcs/0042-config-option.md RFC42] was approved, services are being migrated to use <code>.settings</code>, including <code>llama-cpp</code>. This is already the case for <code>nixos-unstable</code>. If you are using unstable, this is how you can migrate your service:<syntaxhighlight lang="diff">{
+  services.llama-cpp = {
+    enable = true;
+    package = pkgs.llama-cpp-vulkan;
+-    port = 8083;
+-    modelsPreset = {
++    settings.port = 8083;
++    settings.models-preset = (pkgs.formats.ini { }).generate "models-preset.ini" {
+      "Qwen3-Coder-Next" = {
+        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
+        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
+        alias = "unsloth/Qwen3-Coder-Next";
+        temp = "1.0";
+        top-p = "0.95";
+        top-k = "40";
+      };
+    };
+  };
+}</syntaxhighlight>{{Note|TODO: When current unstable becomes stable, remove this troubleshooting and update `llama-server` section}}