Deploying Rancher Server (standalone) behind Traefik with Let’s Encrypt for both

In this article I show how to solve a tricky problem. Rancher Server can use ACME (Let’s Encrypt) to create it’s own certificates or it can create self-signed certificates. To use ACME certs, Rancher  usually needs to be bound exclusively on port 80 and 443. So if the IP address shall be used for other services too (using a proxy like Traefik), Rancher is commonly set up to use self-signed certificates. The drawback of this setup is that „kubeconfig“ files generated by Rancher, also use these self-signed certificates and the user needs to edit it manually.

The solution to this is to set up Traefik for passing HTTPS traffic via TCP/SNI proxying and also to pass the HTTP traffic for the ACME challenge to Traefik.

The Set-Up

I will create a docker-compose based deployment on one server with a single IP address. Rancher will be reachable under https://rancher.example.com while other Services (the Traefik dashboard as an example) will be reachable on https://traefik.example.com. Both will have valid Let’s Encrypt certificates, the one for Rancher issued by Rancher, the other by Traefik.

Setting up Traefik

version: "3.5"

services:
  traefik:
    image: library/traefik:v2.4.5
    networks:
      - services
    restart: unless-stopped
    command: |
      --log.level=INFO
      --api.dashboard=true
      --providers.docker=true
      --providers.docker.exposedbydefault=false
      --entrypoints.http.address=:80
      --entrypoints.https.address=:443
      --certificatesresolvers.default.acme.httpchallenge=true
      --certificatesresolvers.default.acme.email=letsencrypt@example.com
      --certificatesresolvers.default.acme.storage=/data/acme.json
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "./data:/data"
    labels:
      traefik.enable: "true"
      traefik.http.routers.dashboard.rule: Host(`traefik.example.com`)
      traefik.http.routers.dashboard.service: api@internal
      traefik.http.routers.dashboard.tls.certresolver: default
      traefik.http.routers.dashboard.middlewares: auth
      traefik.http.middlewares.auth.basicauth.users: |
         user:$$apr1$$MfPlizqB$$6ltBrKl3/fs9GIewgzIiu0

networks:
  services:
    name: services

This will deploy Traefik and expose it’s dashboard on https://rancher.example.com. The Dashboard is protected by username „user“ with password „password“. Password entries can easily and securely created using http://aspirine.org/htpasswd_en.html. Ensure that each $ is quoted as $$.

Along with traefik, a network „services“ is created. This will be used for all services that needs to be exposed by Traefik because Traefik cannot reach services on different networks. See here for a more detailed explaination.

The problem that Traefik consumes the whole traffic to /.well-known/acme-challenge/ and there’s no easy way to override this for a particular domain. The solution to that is to move the „acme“ endpoint out of the way and create a default rule for /.well-known/acme-challenge/ which can be overridden by a higher-prioritized rule later.

version: "3.0"
services:
  traefik:
    image: library/traefik:v2.4.5
    networks:
      - services
    restart: unless-stopped
    command: |
      --log.level=INFO
      --api.dashboard=true
      --providers.docker=true
      --providers.docker.exposedbydefault=false
      --entrypoints.http.address=:80
      --entrypoints.acme.address=:81
      --entrypoints.https.address=:443
      --certificatesresolvers.default.acme.httpchallenge=true
      --certificatesresolvers.default.acme.httpchallenge.entrypoint=acme
      --certificatesresolvers.default.acme.email=letsencrypt@example.com
      --certificatesresolvers.default.acme.storage=/data/acme.json
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "./data:/data"
    labels:
      traefik.enable: "true"
      traefik.http.routers.dashboard.rule: Host(`traefik.example.com`)
      traefik.http.routers.dashboard.service: api@internal
      traefik.http.routers.dashboard.tls.certresolver: default
      traefik.http.routers.dashboard.middlewares: auth
      traefik.http.middlewares.auth.basicauth.users: |
        mwyraz:$$apr1$$nQ6tUThj$$U35ZqN9mWHQrqpyI.pTix.

      traefik.http.routers.acme.entrypoints: http
      traefik.http.routers.acme.rule: PathPrefix(`/.well-known/acme-challenge/`)
      traefik.http.routers.acme.service: acme-http@internal
      traefik.http.routers.acme.priority: "1000"

      traefik.http.routers.http2https.entrypoints: http
      traefik.http.routers.http2https.rule: PathPrefix(`/`)
      traefik.http.routers.http2https.priority: "900"
      traefik.http.routers.http2https.middlewares: http2https
      traefik.http.middlewares.http2https.redirectscheme.scheme: https 

In line 14, a new entrypoint „acme“ on port 81 is created. The default ACME challenge is bound to this entrypoint (line 16). This moves the „consume everything on /.well-known/acme-challenge/ rule out of the way but it also breaks ACME challenges since it moved everything to port 81.

In lines 35-38 a new rule is created that binds /.well-known/acme-challenge/ to the internal service „acme-http“. The rule has a very high priority but it can later be overridden by a higher priority. This re-enables the ACME challenge functionality.

To test it, stop everything, delete (or rename) acme.json and start it again. A new certificate should be created. You can also open http://traefik.example.com/.well-known/acme-challenge/x and check that Traefik logs the access to the unknown challenge.

I have created another rule (line 40-44) that redirects everything from HTTP to HTTPS. Since the priority is lower than the ACME rule priority, it applies only to non-ACME traffic on HTTP. It can be overridden by any service by using a priority between 901 and 999.

Deploying Rancher

version: "3.0"
services:
  rancher:
    image: rancher/rancher:v2.5.5
    networks:
      - services
    restart: unless-stopped
    privileged: true
    command: |
      --acme-domain rancher.example.com
    volumes:
      - "./data:/var/lib/rancher"
    labels:
      traefik.enable: "true"
      traefik.http.routers.rancher-http.rule: Host(`rancher.example.com`)
      traefik.http.routers.rancher-http.entryPoints: http
      traefik.http.routers.rancher-http.priority: 1001
      traefik.http.services.rancher-http.loadBalancer.server.port: "80"

      traefik.tcp.routers.rancher-https.rule: HostSNI(`rancher.example.com`)
      traefik.tcp.routers.rancher-https.entryPoints: https
      traefik.tcp.routers.rancher-https.tls.passthrough: "true"
      traefik.tcp.services.rancher-https.loadBalancer.server.port: "443"

networks:
  services:
    external:
      name: services

This will start Rancher Server and configures it to create a Let’s Encrypt certificate for rancher.example.com (line 9).

The rule in lines 15-18 sends all HTTP traffic for that domain to rancher. Since the priority is 1001 (which is one higher than the ACME rule above), also the ACME traffic for that domain is sent to Rancher.

The rule in lines 20-23 sends raw TCP traffic on Port 443 (HTTPS) to Rancher so that Rancher can do the HTTPS encryption itself. The special HostSNI rule uses the Server Name Indication protocol of HTTPS to select only traffic for this particular domain.

Testing everything again

For testing I prefer to use „curl -i http://something“ to actually see what’s returned. The browser’s web inspector  will do the same job.

  • http://rancher.example.com should redirect to https://rancher.example.com. The redirection is done by Rancher (you can see this in the 302 response which contains a HTML with a link to the HTTPS page)
  • http://traefik.example.com should redirect to https://traefik.example.com. The redirection is done by Traefik (the 302 response has no HTML body and is text/plain)
  • http://rancher.example.com/.well-known/acme-challenge/x is handled by Rancher. It returns a 404 with the text „acme/autocert: certificate cache miss“
  • http://traefic.example.com/.well-known/acme-challenge/x is handled by Traefik. It blocks for a while and produces log output on Traefik (like „Error getting challenge for token retrying in 13.516432903s“)
  • https://rancher.example.com and https://traefik.example.com should both have valid certificates
  • On Rancher when you export a „kubeconfig“ file, it should not contain any certificate info

 

Veröffentlicht unter Linux

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert