Jump to content

Ollama: Difference between revisions

From NixOS Wiki
Klinger (talk | contribs)
m Add standalone amd override hint
(8 intermediate revisions by 6 users not shown)
Line 2: Line 2:


== Setup ==
== Setup ==
Add following line to your system configuration<syntaxhighlight lang="nix">
You can add Ollama in two ways to your system configuration.
services.ollama.enable = true;
 
As a standalone package:
<syntaxhighlight lang="nix">
environment.systemPackages = [ pkgs.ollama ];
</syntaxhighlight>
 
As a systemd service:
<syntaxhighlight lang="nix">
services.ollama = {
  enable = true;
  # Optional: load models on startup
  loadModels = [ ... ];
};
</syntaxhighlight>
</syntaxhighlight>


Line 13: Line 25:




Example: Enable GPU acceleration for Nvidia graphic cards<syntaxhighlight lang="nix">
Example: Enable GPU acceleration for Nvidia graphic cards
 
As a standalone package:
<syntaxhighlight lang="nix">
environment.systemPackages = [
  (pkgs.ollama.override {
      acceleration = "cuda";
    })
  ];
</syntaxhighlight>
 
As a systemd service:
<syntaxhighlight lang="nix">
services.ollama = {
services.ollama = {
   enable = true;
   enable = true;
Line 19: Line 43:
};
};
</syntaxhighlight>
</syntaxhighlight>
To find out whether a model is running on CPU or GPU, you can either
look at the logs of
<syntaxhighlight lang="bash">
$ ollama serve
</syntaxhighlight>
and search for "looking for compatible GPUs" and "new model will fit in available VRAM in single GPU, loading"
or while a model is answering run in another terminal
<syntaxhighlight lang="bash">
$ ollama ps
NAME        ID              SIZE      PROCESSOR    UNTIL
gemma3:4b    c0494fe00251    4.7 GB    100% GPU    4 minutes from now
</syntaxhighlight>
In this example we see "100% GPU".


== Usage via CLI ==
== Usage via CLI ==
=== Download a model and run interactive prompt ===
=== Download a model and run interactive prompt ===
Example: Download and run Mistral LLM model as an interactive prompt<syntaxhighlight lang="bash">
Example: Download and run Mistral LLM model as an interactive prompt<syntaxhighlight lang="bash">
ollama run mistral
$ ollama run mistral
</syntaxhighlight>For other models see [https://ollama.ai/library Ollama library].
</syntaxhighlight>For other models see [https://ollama.ai/library Ollama library].


Line 29: Line 68:
Example: To download and run codellama with 13 billion parameters in the "instruct" variant and send a prompt:
Example: To download and run codellama with 13 billion parameters in the "instruct" variant and send a prompt:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
ollama run codellama:13b-instruct "Write an extended Python program with a typical structure. It should print the numbers 1 to 10 to standard output."
$ ollama run codellama:13b-instruct "Write an extended Python program with a typical structure. It should print the numbers 1 to 10 to standard output."
</syntaxhighlight>
 
=== See usage and speed statistics ===
Add "--verbose" to see statistics after each prompt:
<syntaxhighlight lang="bash">
$ ollama run codellama:13b-instruct --verbose "Write an extended Python program..."
...
total duration:      50.302071991s
load duration:        50.912267ms
prompt eval count:    49 token(s)
prompt eval duration: 4.654s
prompt eval rate:    10.53 tokens/s <- how fast it processed your input prompt
eval count:          182 token(s)
eval duration:        45.595s
eval rate:            3.99 tokens/s  <- how fast it printed a response
</syntaxhighlight>
</syntaxhighlight>


== Usage via web API ==
== Usage via web API ==
Other software can use the web API (default at: http://localhost:11434 ) to query ollama. This works well e.g. in Intellij-IDEs with the CodeGPT and the "Ollama Commit Summarizer" plugins.
Other software can use the web API (default at: http://localhost:11434 ) to query Ollama. This works well e.g. in Intellij-IDEs with the "ProxyAI" and the "Ollama Commit Summarizer" plugins.
 
Alternatively, on enabling "open-webui", a web portal is available at: http://localhost:8080/:
services.open-webui.enable = true;


== Troubleshooting ==
== Troubleshooting ==
=== AMD GPU with open source driver ===  
=== AMD GPU with open source driver ===  


In certain cases ollama might not allow your system to use GPU acceleration if it cannot be sure your GPU/driver is compatible.
In certain cases Ollama might not allow your system to use GPU acceleration if it cannot be sure your GPU/driver is compatible.


However you can attempt to force-enable the usage of your GPU by overriding the LLVM target. <ref>https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides</ref>
However you can attempt to force-enable the usage of your GPU by overriding the LLVM target. <ref>https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides</ref>
Line 49: Line 106:
</syntaxhighlight>
</syntaxhighlight>


In this example the LLVM target is "gfx1031", that is, version "10.3.1", you can then override that value for ollama:
In this example the LLVM target is "gfx1031", that is, version "10.3.1", you can then override that value for Ollama for the systemd service:
<syntaxhighlight lang="nix">
<syntaxhighlight lang="nix">
services.ollama = {
services.ollama = {
Line 57: Line 114:
     HCC_AMDGPU_TARGET = "gfx1031"; # used to be necessary, but doesn't seem to anymore
     HCC_AMDGPU_TARGET = "gfx1031"; # used to be necessary, but doesn't seem to anymore
   };
   };
  # results in environment variable "HSA_OVERRIDE_GFX_VERSION=10.3.1"
   rocmOverrideGfx = "10.3.1";
   rocmOverrideGfx = "10.3.1";
};
};
</syntaxhighlight>
</syntaxhighlight>
or via an environment variable in front of the standalone app
<syntaxhighlight lang="bash">
HSA_OVERRIDE_GFX_VERSION=10.3.1 ollama serve
</syntaxhighlight>
If there are still errors, you can attempt to set a similar value that is listed [https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides here].
If there are still errors, you can attempt to set a similar value that is listed [https://github.com/ollama/ollama/blob/main/docs/gpu.md#overrides here].



Revision as of 22:03, 19 March 2025

Ollama is an open-source framework designed to facilitate the deployment of large language models on local environments. It aims to simplify the complexities involved in running and managing these models, providing a seamless experience for users across different operating systems.

Setup

You can add Ollama in two ways to your system configuration.

As a standalone package:

environment.systemPackages = [ pkgs.ollama ];

As a systemd service:

services.ollama = {
  enable = true;
  # Optional: load models on startup
  loadModels = [ ... ];
};

Configuration of GPU acceleration

Its possible to use following values for acceleration:

  • false: disable GPU, only use CPU
  • "rocm": supported by most modern AMD GPUs
  • "cuda": supported by most modern NVIDIA GPUs


Example: Enable GPU acceleration for Nvidia graphic cards

As a standalone package:

environment.systemPackages = [
   (pkgs.ollama.override { 
      acceleration = "cuda";
    })
  ];

As a systemd service:

services.ollama = {
  enable = true;
  acceleration = "cuda";
};

To find out whether a model is running on CPU or GPU, you can either look at the logs of

$ ollama serve

and search for "looking for compatible GPUs" and "new model will fit in available VRAM in single GPU, loading"

or while a model is answering run in another terminal

$ ollama ps
NAME         ID              SIZE      PROCESSOR    UNTIL
gemma3:4b    c0494fe00251    4.7 GB    100% GPU     4 minutes from now

In this example we see "100% GPU".

Usage via CLI

Download a model and run interactive prompt

Example: Download and run Mistral LLM model as an interactive prompt

$ ollama run mistral

For other models see Ollama library.

Send a prompt to ollama

Example: To download and run codellama with 13 billion parameters in the "instruct" variant and send a prompt:

$ ollama run codellama:13b-instruct "Write an extended Python program with a typical structure. It should print the numbers 1 to 10 to standard output."

See usage and speed statistics

Add "--verbose" to see statistics after each prompt:

$ ollama run codellama:13b-instruct --verbose "Write an extended Python program..."
...
total duration:       50.302071991s
load duration:        50.912267ms
prompt eval count:    49 token(s)
prompt eval duration: 4.654s
prompt eval rate:     10.53 tokens/s <- how fast it processed your input prompt
eval count:           182 token(s)
eval duration:        45.595s
eval rate:            3.99 tokens/s  <- how fast it printed a response

Usage via web API

Other software can use the web API (default at: http://localhost:11434 ) to query Ollama. This works well e.g. in Intellij-IDEs with the "ProxyAI" and the "Ollama Commit Summarizer" plugins.

Alternatively, on enabling "open-webui", a web portal is available at: http://localhost:8080/:

services.open-webui.enable = true;

Troubleshooting

AMD GPU with open source driver

In certain cases Ollama might not allow your system to use GPU acceleration if it cannot be sure your GPU/driver is compatible.

However you can attempt to force-enable the usage of your GPU by overriding the LLVM target. [1]

You can get the version for your GPU from the logs or like so:

$ nix-shell -p "rocmPackages.rocminfo" --run "rocminfo" | grep "gfx"
Name:                    gfx1031

In this example the LLVM target is "gfx1031", that is, version "10.3.1", you can then override that value for Ollama for the systemd service:

services.ollama = {
  enable = true;
  acceleration = "rocm";
  environmentVariables = {
    HCC_AMDGPU_TARGET = "gfx1031"; # used to be necessary, but doesn't seem to anymore
  };
  # results in environment variable "HSA_OVERRIDE_GFX_VERSION=10.3.1"
  rocmOverrideGfx = "10.3.1";
};

or via an environment variable in front of the standalone app

HSA_OVERRIDE_GFX_VERSION=10.3.1 ollama serve

If there are still errors, you can attempt to set a similar value that is listed here.