Files
cluster-platform-v3/templates/tenants-wildcard-cert.yaml
pro-777 7e3280aa26 feat(slice 2B.3): chart Restore half — injectedWildcards conditional (0.5.7)
Add the chart-side machinery that lets Tower bypass the cert-manager
Certificate path on Reconnect by injecting a Vault-stashed wildcard
cert directly as a kubernetes.io/tls Secret.

values.yaml:
  certManager.injectedWildcards: []
    Each entry: { root, primary, crt, key }. Empty list = legacy ACME-only.

templates/tenants-wildcard-cert.yaml:
  Build $injectedRoots index from injectedWildcards[]; per-domain
  Certificate is skipped when its root has an injected entry.

templates/tenants-wildcard-secret.yaml (NEW):
  Per injected entry, render kubernetes.io/tls Secret using the same
  name the cert path would have produced (tenants-wildcard-tls primary,
  tenants-wildcard-<root-as-dashes>-tls non-primary). Sync-wave 2 to
  match the cert path's timing. Label odoosky.io/wildcard-source=
  vault-injected so harvester can skip them.

Verified via helm template + self-signed dummy cert:
  - Pure injection: 0 Certificate, 1 Secret (correct name + base64)
  - Pure ACME: 1 Certificate, 0 Secret (status quo)
  - Mixed (2 domains, 1 injected): 1 Certificate + 1 Secret

Inert without Tower wiring — existing clusters render identically to
0.5.6 because injectedWildcards defaults to []. Pushed first as the
foundation layer for the upcoming Tower restore + harvester slices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:27:30 +03:00

110 lines
4.9 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# tenants-wildcard Certificate(s) — one per VERIFIED domain in
# tenant.domains[] (#320.C). The primary entry keeps the canonical
# `tenants-wildcard` / `tenants-wildcard-tls` names so existing
# instances (whose IngressRoute references that exact secret) keep
# serving without re-deploy. Each non-primary domain gets its own
# Certificate + Secret named after the root with `.` → `-`, so the
# cluster ends up with N TLS Secrets — one per tenant domain — and
# instances can pick the right one based on their host.
#
# Legacy fallback: when tenant.domains[] is empty (a chart consumer
# from before #320.A), synthesize a single entry from the scalar
# tenant.wildcardHost so this template stays one-pass.
#
# Verified=false entries are skipped on purpose — that's the safety
# valve called out in #320.A. A half-configured add-domain (root set,
# DNS not yet pointed) waits in the data layer; the chart doesn't
# try to issue and stall the whole sync.
#
# DNS-01 takes 3090 s on a fast day, 510 min on a slow one
# (Cloudflare zone propagation + LE order processing). Until Slice
# 2B.1 (2026-05-04) the wildcard Certificate's Ready status gated
# the entire Argo Application's Health — meaning Connect Server
# sat at "Provisioning…" for the full 510 min before substrate
# became "Ready", even though all the BASE infra (longhorn,
# cert-manager, traefik, registry) was up within ~30 s.
#
# The annotation `argocd.argoproj.io/sync-options: SkipHealthCheck=true`
# below tells Argo "still sync this resource, but don't include
# its Ready status when computing the parent Application's Health".
# Result: substrate becomes Ready in ~30 s; the wildcard issues in
# the background.
#
# Tradeoff: an instance deployed inside the first ~5 min after
# Connect references a Secret (`tenants-wildcard-tls`) that doesn't
# exist yet — its IngressRoute is healthy but TLS is unavailable.
# Slice 2B.2 will plumb a per-host HTTP-01 fallback so the very
# first deploy is also fast. Until then the operator should know:
# Substrate Ready ≠ wildcard ready. Watch for the Secret to appear
# (`kubectl -n tenants get secret tenants-wildcard-tls`) before the
# first deploy on a fresh cluster.
{{- $domains := .Values.tenant.domains | default (list) }}
{{- if and (eq (len $domains) 0) .Values.tenant.wildcardHost }}
{{- $domains = list (dict
"root" .Values.tenant.domain
"wildcardHost" .Values.tenant.wildcardHost
"primary" true
"verified" true) }}
{{- end }}
{{/* Slice 2B.3 — index of roots that have a Vault-stashed cert
injected via certManager.injectedWildcards[]. We skip the
Certificate resource entirely for those; the sibling
tenants-wildcard-secret.yaml renders the kubernetes.io/tls
Secret directly so no ACME order is placed. */}}
{{- $injectedRoots := dict }}
{{- range .Values.certManager.injectedWildcards | default (list) }}
{{- if and .root .crt .key }}
{{- $_ := set $injectedRoots .root true }}
{{- end }}
{{- end }}
{{- range $i, $d := $domains }}
{{- if and $d.verified $d.wildcardHost (not (hasKey $injectedRoots $d.root)) }}
{{- $suffix := "" }}
{{- if not $d.primary }}
{{- $suffix = printf "-%s" (replace "." "-" $d.root) }}
{{- end }}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: {{ printf "tenants-wildcard%s" $suffix | quote }}
namespace: tenants
labels:
app.kubernetes.io/managed-by: cluster-platform-v3
odoosky.io/domain-root: {{ $d.root | quote }}
{{- if $d.primary }}
odoosky.io/domain-primary: "true"
{{- end }}
annotations:
# Slice 2B.1 — substrate Ready in ~30 s. Argo will still
# sync this Certificate (cert-manager will issue it via
# DNS-01 in the background), but its Ready condition does
# NOT gate the parent Application's Health calculation. So
# the cluster-platform-v3 App flips Healthy as soon as the
# base components (longhorn + cert-manager + traefik +
# registry) are up, instead of waiting 510 min for LE to
# finish the wildcard issuance.
argocd.argoproj.io/sync-options: SkipHealthCheck=true
# Slice 2B.1.1 — wave 2: apply AFTER the ClusterIssuer
# (wave 1) which depends on cert-manager (wave 0 default).
# Argo enforces strict wave ordering with health-gating
# between waves, so the Certificate never lands before its
# ClusterIssuer exists or before cert-manager-webhook is
# accepting admission requests. Eliminates the retries=2
# exponential-backoff penalty observed on demo-server105.
argocd.argoproj.io/sync-wave: "2"
spec:
secretName: {{ printf "tenants-wildcard%s-tls" $suffix | quote }}
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
commonName: {{ $d.wildcardHost | quote }}
dnsNames:
- {{ $d.wildcardHost | quote }}
# Renew 30 days before expiry — Let's Encrypt certs are 90-day, so
# this gives cert-manager a 30-day window to retry if Cloudflare
# has a bad day during renewal.
renewBefore: 720h
{{- end }}
{{- end }}