Kubernetes: Difference between revisions

← Older edit Newer edit →

VisualWikitext

Revision as of 22:27, 30 March 2020

1 Master and 1 Node

Assumptions:

Master and Node are on the same network (in this example 10.1.1.0/24)
IP of the Master: 10.1.1.2
IP of the first Node: 10.1.1.3

Caveats:

this was only tested on 20.09pre215024.e97dfe73bba (Nightingale) (unstable)
this is probably not best-practice
- for a production-grade cluster you shouldn't use easyCerts

Master

Add to your configuration.nix:

{ config, pkgs, ... }:
let
  kubeMasterIP = "10.1.1.2";
  kubeMasterHostname = "api.kube";
  kubeMasterAPIServerPort = 443;
in
{
  # resolve master hostname
  networking.extraHosts = "${kubeMasterIP} ${kubeMasterHostname}";

  # packages for administration tasks
  environment.systemPackages = with pkgs; [
    kompose
    kubectl
    kubernetes
  ];

  services.kubernetes = {
    roles = ["master" "node"];
    masterAddress = kubeMasterHostname;
    easyCerts = true;
    apiserver = {
      securePort = kubeMasterAPIServerPort;
      advertiseAddress = kubeMasterIP;
    };

    # use coredns
    addons.dns.enable = true;

    # needed if you use swap
    kubelet.extraOpts = "--fail-swap-on=false";
  };
}

Apply your config (e.g. nixos-rebuild switch).

Link your kubeconfig to your home directory:

ln -s /etc/kubernetes/cluster-admin.kubeconfig ~/.kube/config

Now, executing kubectl cluster-info should yield something like this:

Kubernetes master is running at https://10.1.1.2
CoreDNS is running at https://10.1.1.2/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

You should also see that the master is also a node using kubectl get nodes:

NAME       STATUS   ROLES    AGE   VERSION
direwolf   Ready    <none>   41m   v1.16.6-beta.0

Node

Add to your configuration.nix:

{ config, pkgs, ... }:
let
  kubeMasterIP = "10.1.1.2";
  kubeMasterHostname = "api.kube";
  kubeMasterAPIServerPort = 443;
in
{
  # resolve master hostname
  networking.extraHosts = "${kubeMasterIP} ${kubeMasterHostname}";

  # packages for administration tasks
  environment.systemPackages = with pkgs; [
    kompose
    kubectl
    kubernetes
  ];

  services.kubernetes = let
    api = "https://${kubeMasterHostname}:${kubeMasterAPIServerPort}";
  in
  {
    roles = ["node"];
    masterAddress = kubeMasterHostname;
    easyCerts = true;

    # point kubelet and other services to kube-apiserver
    kubelet.kubeconfig.server = api;
    apiserverAddress = api;

    # use coredns
    addons.dns.enable = true;

    # needed if you use swap
    kubelet.extraOpts = "--fail-swap-on=false";
  };
}

Apply your config (e.g. nixos-rebuild switch).

According to the NixOS tests, make your Node join the cluster:

# on the master, grab the apitoken
cat /var/lib/kubernetes/secrets/apitoken.secret

# on the node, join the node with
echo TOKEN | nixos-kubernetes-node-join

After that, you should see your new node using kubectl get nodes:

NAME       STATUS   ROLES    AGE    VERSION
direwolf   Ready    <none>   62m    v1.16.6-beta.0
drake      Ready    <none>   102m   v1.16.6-beta.0

N Masters (HA)

☶︎

This article or section needs to be expanded. Further information may be found in the related discussion page. Please consult the pedia article metapage for guidelines on contributing.

Troubleshooting

systemctl status kubelet
systemctl status kube-apiserver
kubectl get nodes

Join Cluster not working

If you face issues while running the nixos-kubernetes-node-join script:

Restarting certmgr...
Job for certmgr.service failed because a timeout was exceeded.
See "systemctl status certmgr.service" and "journalctl -xe" for details.

Go investigate with journalctl -u certmgr:

... certmgr: loading from config file /nix/store/gj7qr7lp6wakhiwcxdpxwbpamvmsifhk-certmgr.yaml
... manager: loading certificates from /nix/store/4n41ikm7322jxg7bh0afjpxsd4b2idpv-certmgr.d
... manager: loading spec from /nix/store/4n41ikm7322jxg7bh0afjpxsd4b2idpv-certmgr.d/flannelClient.json
... [ERROR] cert: failed to fetch remote CA: failed to parse rootCA certs

In this case, cfssl could be overloaded.

Restarting cfssl on the master node should help: systemctl restart cfssl

DNS issues

Check if coredns is running via kubectl get pods -n kube-system:

NAME                       READY   STATUS    RESTARTS   AGE
coredns-577478d784-bmt5s   1/1     Running   2          163m
coredns-577478d784-bqj65   1/1     Running   2          163m

Run a pod to check with kubectl run curl --restart=Never --image=radial/busyboxplus:curl -i --tty:

If you don't see a command prompt, try pressing enter.
[ root@curl:/ ]$ nslookup google.com
Server:    10.0.0.254
Address 1: 10.0.0.254 kube-dns.kube-system.svc.cluster.local

Name:      google.com
Address 1: 2a00:1450:4016:803::200e muc12s04-in-x0e.1e100.net
Address 2: 172.217.23.14 lhr35s01-in-f14.1e100.net

In case DNS is still not working I found that sometimes, restarting services helps:

systemctl restart kube-proxy flannel kubelet

reset to a clean state

Sometimes it helps to have a clean state on all instances:

comment kubernetes-related code in configuration.nix
nixos-rebuild switch
clean up filesystem
- rm -rf /var/lib/kubernetes/ /var/lib/etcd/ /var/lib/cfssl/ /var/lib/kubelet/
- rm -rf /etc/kube-flannel/ /etc/kubernetes/
uncomment kubernetes-related code again
nixos-rebuild switch

Miscellaneous

NVIDIA

You can use NVIDIA's k8s-device-plugin.

Make nvidia-docker your default docker runtime:

virtualisation.docker = {
    enable = true;

    # use nvidia as the default runtime
    enableNvidia = true;
    extraOptions = "--default-runtime=nvidia";
};

Apply their Daemonset:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

`/dev/shm`

Some applications need enough shared memory to work properly. Create a new volumeMount for your Deployment:

...
volumeMounts:
- mountPath: /dev/shm
  name: dshm
...

and mark its medium as Memory:

...
volumes:
- name: dshm
  emptyDir:
  medium: Memory
...

Tooling

There are various community projects aimed at facilitating working with Kubernetes combined with Nix:

kubernix: simple setup of development clusters using Nix
kube-nix

Sources

Issue #39327: kubernetes support is missing some documentation
NixOS Discourse: Using multiple nodes on unstable
Kubernetes docs
NixOS e2e kubernetes tests: Node Joining etc.
IRC (2018-09): issues related to DNS
IRC (2019-09): discussion about easyCerts and general setup

@@ Line 145: / Line 145: @@
 {{expansion|How to set this up?}}
-== Debugging ==
+== Troubleshooting ==
 <syntaxhighlight>
@@ Line 152: / Line 152: @@
 kubectl get nodes
 </syntaxhighlight>
+=== Join Cluster not working ===
+If you face issues while running the <code>nixos-kubernetes-node-join</code> script:
+<syntaxhighlight>
+Restarting certmgr...
+Job for certmgr.service failed because a timeout was exceeded.
+See "systemctl status certmgr.service" and "journalctl -xe" for details.
+</syntaxhighlight>
+Go investigate with <code>journalctl -u certmgr</code>:
+<syntaxhighlight>
+... certmgr: loading from config file /nix/store/gj7qr7lp6wakhiwcxdpxwbpamvmsifhk-certmgr.yaml
+... manager: loading certificates from /nix/store/4n41ikm7322jxg7bh0afjpxsd4b2idpv-certmgr.d
+... manager: loading spec from /nix/store/4n41ikm7322jxg7bh0afjpxsd4b2idpv-certmgr.d/flannelClient.json
+... [ERROR] cert: failed to fetch remote CA: failed to parse rootCA certs
+</syntaxhighlight>
+In this case, <code>cfssl</code> could be overloaded.
+Restarting cfssl on the <code>master</code> node should help: <code>systemctl restart cfssl</code>
 === DNS issues ===