Llama-cpp: Difference between revisions

Woile (talk | contribs)
Notes about AMD ROCm
Woile (talk | contribs)
new heading MoE and styling improvements
 
(3 intermediate revisions by the same user not shown)
Line 19: Line 19:
==== in NixOS ====
==== in NixOS ====


After enable Unfree software in NixOS add CUDA to your packages
After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos">
 
{
<pre>
  environment.systemPackages = [
environment.systemPackages = [
    (pkgs.llama-cpp.override { cudaSupport = true; })
  (pkgs.llama-cpp.override { cudaSupport = true; })
  ];
];
}
</pre>
</syntaxhighlight>And do a switch to the new configuration
 
And do a switch to the new configuration
  sudo nixos-rebuild switch
  sudo nixos-rebuild switch


==== in a shell ====
==== in a shell ====


If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash">
 
<pre>
export NIXPKGS_ALLOW_UNFREE=1
export NIXPKGS_ALLOW_UNFREE=1
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }'
</pre>
</syntaxhighlight>


=== BLAS Support ===
=== BLAS Support ===


BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:
BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos">
 
{
<pre>
  environment.systemPackages = [
environment.systemPackages = [
    (pkgs.llama-cpp.override { blasSupport = true; })
  (pkgs.llama-cpp.override { blasSupport = true; })
  ];
];
}
</pre>
</syntaxhighlight>


=== AMD ROCm ===
=== AMD ROCm ===
Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the HSA_OVERRIDE_GFX_VERSION environmental variable.
Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable.


E.g:<syntaxhighlight lang="bash">
E.g:<syntaxhighlight lang="bash">
Line 63: Line 59:
|RDNA 3 APU
|RDNA 3 APU
|11.0.0
|11.0.0
|e.g: 780M
|780M
|-
|-
|Strix Point
|Strix Point
|11.5.0
|11.5.0
|e.g: 880M
|880M
|-
|-
|Strix Halo
|Strix Halo
|11.5.1
|11.5.1
|e.g: Radeon 8060S
|Radeon 8060S
|-
|-
|RDNA 4 "Navi 48"
|RDNA 4 "Navi 48"
|12.0.1
|12.0.1
|e.g: Radeon RX 9070 XT
|Radeon RX 9070 XT
|}
|}
Models


== Models ==
When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.  
When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model.  


Line 90: Line 86:


And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc.
=== Does it run on your machine? ===
=== How much RAM do I need? ===


To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM.
Line 96: Line 92:
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM.


When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference]
 
=== What are Mixture of Experts (MoE)? ===
MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B).
 
This ends up improving performance, but the model still needs to fit in RAM.
 
For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active.


=== Performance ===
=== Performance ===


Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance.
Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model.
 
Don't confuse GPU's memory bandwidth with GPU's memory (RAM).
{| class="wikitable"
{| class="wikitable"
|+
|+
Line 143: Line 148:
Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool.  
Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool.  


In your shell:
In your shell:<syntaxhighlight lang="bash">
 
llama-cli \  
<pre>
  -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \  
llama-cli \
  --temp 1.0 --top-p 0.95 --top-k 40 \  
    -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \
  -p "briefly explain journalctl in one paragraph"
    --temp 1.0 --top-p 0.95 --top-k 40 \
</syntaxhighlight>
    -p "briefly explain journalctl in one paragraph"
</pre>


== llama-server ==
== llama-server ==
Line 156: Line 159:
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]].


You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.
You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.<syntaxhighlight lang="nixos">
services.llama-cpp = {
{
  enable = true;
  services.llama-cpp = {
  package = pkgs.llama-cpp-vulkan;
    enable = true;
    package = pkgs.llama-cpp-vulkan;
  # Takes care of downloading if model not present
    # Takes care of downloading if model not present
  modelsPreset = {
    modelsPreset = {
    "Qwen3-Coder-Next" = {
      "Qwen3-Coder-Next" = {
      hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
        hf-repo = "unsloth/Qwen3-Coder-Next-GGUF";
      hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
        hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf";
      alias = "unsloth/Qwen3-Coder-Next";
        alias = "unsloth/Qwen3-Coder-Next";
      temp = "1.0";
        temp = "1.0";
      top-p = "0.95";
        top-p = "0.95";
      top-k = "40";
        top-k = "40";
    };
      };
  };
    };
};
  };
And do a switch to the new configuration
}
 
</syntaxhighlight>And do a switch to the new configuration


<pre>
<pre>
sudo nixos-rebuild switch
sudo nixos-rebuild switch
</pre>
</pre>
=== Troubleshooting ===
==== Failed to create //.cache for shader cache ====
This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos">
{
  systemd.services.llama-cpp = {
    environment = {
      XDG_CACHE_HOME = "/var/cache/llama-cpp";
      MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp";
    };
  };
}
</syntaxhighlight>