Files
cluster-platform-v3/templates/tenants-wildcard-cert.yaml

99 lines
4.4 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# tenants-wildcard Certificate(s) — one per VERIFIED domain in
# tenant.domains[] (#320.C). The primary entry keeps the canonical
# `tenants-wildcard` / `tenants-wildcard-tls` names so existing
# instances (whose IngressRoute references that exact secret) keep
# serving without re-deploy. Each non-primary domain gets its own
# Certificate + Secret named after the root with `.` → `-`, so the
# cluster ends up with N TLS Secrets — one per tenant domain — and
# instances can pick the right one based on their host.
#
# Legacy fallback: when tenant.domains[] is empty (a chart consumer
# from before #320.A), synthesize a single entry from the scalar
# tenant.wildcardHost so this template stays one-pass.
#
# Verified=false entries are skipped on purpose — that's the safety
# valve called out in #320.A. A half-configured add-domain (root set,
# DNS not yet pointed) waits in the data layer; the chart doesn't
# try to issue and stall the whole sync.
#
# DNS-01 takes 3090 s on a fast day, 510 min on a slow one
# (Cloudflare zone propagation + LE order processing). Until Slice
# 2B.1 (2026-05-04) the wildcard Certificate's Ready status gated
# the entire Argo Application's Health — meaning Connect Server
# sat at "Provisioning…" for the full 510 min before substrate
# became "Ready", even though all the BASE infra (longhorn,
# cert-manager, traefik, registry) was up within ~30 s.
#
# The annotation `argocd.argoproj.io/sync-options: SkipHealthCheck=true`
# below tells Argo "still sync this resource, but don't include
# its Ready status when computing the parent Application's Health".
# Result: substrate becomes Ready in ~30 s; the wildcard issues in
# the background.
#
# Tradeoff: an instance deployed inside the first ~5 min after
# Connect references a Secret (`tenants-wildcard-tls`) that doesn't
# exist yet — its IngressRoute is healthy but TLS is unavailable.
# Slice 2B.2 will plumb a per-host HTTP-01 fallback so the very
# first deploy is also fast. Until then the operator should know:
# Substrate Ready ≠ wildcard ready. Watch for the Secret to appear
# (`kubectl -n tenants get secret tenants-wildcard-tls`) before the
# first deploy on a fresh cluster.
{{- $domains := .Values.tenant.domains | default (list) }}
{{- if and (eq (len $domains) 0) .Values.tenant.wildcardHost }}
{{- $domains = list (dict
"root" .Values.tenant.domain
"wildcardHost" .Values.tenant.wildcardHost
"primary" true
"verified" true) }}
{{- end }}
{{- range $i, $d := $domains }}
{{- if and $d.verified $d.wildcardHost }}
{{- $suffix := "" }}
{{- if not $d.primary }}
{{- $suffix = printf "-%s" (replace "." "-" $d.root) }}
{{- end }}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: {{ printf "tenants-wildcard%s" $suffix | quote }}
namespace: tenants
labels:
app.kubernetes.io/managed-by: cluster-platform-v3
odoosky.io/domain-root: {{ $d.root | quote }}
{{- if $d.primary }}
odoosky.io/domain-primary: "true"
{{- end }}
annotations:
# Slice 2B.1 — substrate Ready in ~30 s. Argo will still
# sync this Certificate (cert-manager will issue it via
# DNS-01 in the background), but its Ready condition does
# NOT gate the parent Application's Health calculation. So
# the cluster-platform-v3 App flips Healthy as soon as the
# base components (longhorn + cert-manager + traefik +
# registry) are up, instead of waiting 510 min for LE to
# finish the wildcard issuance.
argocd.argoproj.io/sync-options: SkipHealthCheck=true
# Slice 2B.1.1 — wave 2: apply AFTER the ClusterIssuer
# (wave 1) which depends on cert-manager (wave 0 default).
# Argo enforces strict wave ordering with health-gating
# between waves, so the Certificate never lands before its
# ClusterIssuer exists or before cert-manager-webhook is
# accepting admission requests. Eliminates the retries=2
# exponential-backoff penalty observed on demo-server105.
argocd.argoproj.io/sync-wave: "2"
spec:
secretName: {{ printf "tenants-wildcard%s-tls" $suffix | quote }}
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
commonName: {{ $d.wildcardHost | quote }}
dnsNames:
- {{ $d.wildcardHost | quote }}
# Renew 30 days before expiry — Let's Encrypt certs are 90-day, so
# this gives cert-manager a 30-day window to retry if Cloudflare
# has a bad day during renewal.
renewBefore: 720h
{{- end }}
{{- end }}