Llama-cpp: Difference between revisions
titles improved |
new heading MoE and styling improvements |
||
| (4 intermediate revisions by the same user not shown) | |||
| Line 19: | Line 19: | ||
==== in NixOS ==== | ==== in NixOS ==== | ||
After enable Unfree software in NixOS add CUDA to your packages | After enable Unfree software in NixOS add CUDA to your packages<syntaxhighlight lang="nixos"> | ||
{ | |||
< | environment.systemPackages = [ | ||
environment.systemPackages = [ | (pkgs.llama-cpp.override { cudaSupport = true; }) | ||
]; | |||
]; | } | ||
</ | </syntaxhighlight>And do a switch to the new configuration | ||
sudo nixos-rebuild switch | |||
And do a switch to the new configuration | |||
sudo nixos-rebuild switch | |||
==== in a shell ==== | ==== in a shell ==== | ||
If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell: | If you want take the CUDA package for a spin, before adding it to your system, you can open it in a shell:<syntaxhighlight lang="bash"> | ||
< | |||
export NIXPKGS_ALLOW_UNFREE=1 | export NIXPKGS_ALLOW_UNFREE=1 | ||
nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }' | nix shell --impure --expr '(import (builtins.getFlake "nixpkgs") {}).llama-cpp.override { cudaSupport = true; }' | ||
</ | </syntaxhighlight> | ||
=== BLAS Support === | === BLAS Support === | ||
BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing: | BLAS support is automatically enabled if none of the GPU accelerators are enabled. You can still manually enable it in your nix configuration by doing:<syntaxhighlight lang="nixos"> | ||
{ | |||
environment.systemPackages = [ | |||
(pkgs.llama-cpp.override { blasSupport = true; }) | |||
]; | |||
} | |||
</syntaxhighlight> | |||
=== AMD ROCm === | |||
Sometimes, ROCm might not be using the correct GPU architecture, or you simply want to try a different one, because it might work better. To tell ROCm which GPU architecture to use, you can use the <code>HSA_OVERRIDE_GFX_VERSION</code> environmental variable. | |||
< | E.g:<syntaxhighlight lang="bash"> | ||
export HSA_OVERRIDE_GFX_VERSION='11.5.1' | |||
</syntaxhighlight> | |||
{| class="wikitable" | |||
|+ | |||
!Arch | |||
!Version | |||
!Example card | |||
|- | |||
|RDNA 3 APU | |||
|11.0.0 | |||
|780M | |||
|- | |||
|Strix Point | |||
|11.5.0 | |||
|880M | |||
|- | |||
|Strix Halo | |||
|11.5.1 | |||
|Radeon 8060S | |||
|- | |||
|RDNA 4 "Navi 48" | |||
|12.0.1 | |||
|Radeon RX 9070 XT | |||
|} | |||
== Models == | == Models == | ||
When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model. | When usage <code>llama-cli</code> or <code>llama-server</code>, you can tune the parameters of the model. | ||
| Line 65: | Line 86: | ||
And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc. | And more general purpose models, may include multiple options sets for different purposes, e.g: chat, coding, agent, etc. | ||
=== | === How much RAM do I need? === | ||
To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM. | To calculate how much RAM you are going to need, the first signal is the number of parameters. You can make a loose estimate that 80B params will require 80GB RAM. | ||
| Line 71: | Line 92: | ||
A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM. | A better signal, is the size of the model. An 80B params model may be quantized, and the final size might be 60GB. By adding a 15% for context buffer, you would need an aprox. of 69 GB RAM. | ||
When your GPU doesn't have enough RAM, with <code>llama-cli</code> or <code>llama-server</code> you can offload some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference] | When your '''GPU doesn't have enough RAM''', with <code>llama-cli</code> or <code>llama-server</code> you can '''offload''' some of it to your system's RAM, by using the flag <code>-ngl</code>. Read the [https://github.com/ggml-org/llama.cpp/blob/master/docs/multi-gpu.md#command-line-arguments-reference cli reference] | ||
=== What are Mixture of Experts (MoE)? === | |||
MoE refers to models with a high param count (e.g: 35B), but they only activate a small amount during usage (e.g: 3B). | |||
This ends up improving performance, but the model still needs to fit in RAM. | |||
For example, <code>Qwen3.6-35B-A3B</code> has 35B param count, but only 3B are active. | |||
=== Performance === | === Performance === | ||
Performance is bottle-necked by memory bandwidth. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance. | Performance is '''bottle-necked by memory bandwidth'''. You can keep an eye for new developments and breakthroughs. But otherwise, it's hard to squeeze more performance from a model. | ||
Don't confuse GPU's memory bandwidth with GPU's memory (RAM). | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ | |+ | ||
| Line 118: | Line 148: | ||
Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool. | Once you've made <code>llama-cpp</code> available in your system. You can use <code>llama-cli</code>, which is a straightforward to use tool. | ||
In your shell: | In your shell:<syntaxhighlight lang="bash"> | ||
llama-cli \ | |||
< | -hf bartowski/Qwen_Qwen3-Coder-Next-GGUF:Q4_K_M \ | ||
llama-cli \ | --temp 1.0 --top-p 0.95 --top-k 40 \ | ||
-p "briefly explain journalctl in one paragraph" | |||
</syntaxhighlight> | |||
</ | |||
== llama-server == | == llama-server == | ||
| Line 131: | Line 159: | ||
<code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]]. | <code>llama-server</code> runs a server, and it can run models on demand. It's quite similar to [[Ollama]]. | ||
You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service. | You can manually start the server from your terminal, it's usage, is not that different from <code>llama-cli</code>, but we are going to see the integration with NixOS as a service.<syntaxhighlight lang="nixos"> | ||
{ | |||
services.llama-cpp = { | |||
enable = true; | |||
package = pkgs.llama-cpp-vulkan; | |||
# Takes care of downloading if model not present | |||
modelsPreset = { | |||
"Qwen3-Coder-Next" = { | |||
hf-repo = "unsloth/Qwen3-Coder-Next-GGUF"; | |||
hf-file = "Qwen3-Coder-Next-UD-Q4_K_XL.gguf"; | |||
alias = "unsloth/Qwen3-Coder-Next"; | |||
temp = "1.0"; | |||
top-p = "0.95"; | |||
top-k = "40"; | |||
}; | |||
}; | |||
}; | |||
And do a switch to the new configuration | } | ||
</syntaxhighlight>And do a switch to the new configuration | |||
<pre> | <pre> | ||
sudo nixos-rebuild switch | sudo nixos-rebuild switch | ||
</pre> | </pre> | ||
=== Troubleshooting === | |||
==== Failed to create //.cache for shader cache ==== | |||
This is a known issue ([https://github.com/NixOS/nixpkgs/issues/441531 441531]), until it gets fixed, you can add to your conf:<syntaxhighlight lang="nixos"> | |||
{ | |||
systemd.services.llama-cpp = { | |||
environment = { | |||
XDG_CACHE_HOME = "/var/cache/llama-cpp"; | |||
MESA_SHADER_CACHE_DIR = "/var/cache/llama-cpp"; | |||
}; | |||
}; | |||
} | |||
</syntaxhighlight> | |||