Files
cluster-platform-v3/templates/cluster-issuer.yaml
OdooSky v3 d602063448 chart 0.7.3 — slug-suffix per-tenant ClusterIssuer (qsoft2 SSL fix)
cluster-issuer.yaml: name → letsencrypt-prod-{{ tenant.slug }}, hard-pin
apiTokenSecretRef.name to cloudflare-api-token-{{ tenant.slug }} so it
matches the ESO-created Secret. ACME account key also slug-suffixed
for tenant isolation. Pre-0.7.3 the unsuffixed letsencrypt-prod
mismatched what instance.go:504 stamps into per-instance Certificates
(letsencrypt-prod-<slug>), so cert-manager logged 'Referenced
ClusterIssuer not found' and erp2 served Traefik default cert forever.

tenants-wildcard-cert.yaml: issuerRef.name → letsencrypt-prod-{{ $.Values.tenant.slug }}
to match the renamed ClusterIssuer.

values.yaml: secrets.cloudflareTokenSecret block deprecated (the chart
no longer reads it; kept for back-compat with external overrides).

Diagnosed in the qsoft2 migrate test 2026-05-09.
2026-05-09 21:30:36 +03:00

77 lines
3.7 KiB
YAML

{{- if and .Values.tenant.domain .Values.tenant.slug }}
# letsencrypt-prod-<slug> ClusterIssuer — DNS-01 challenge via Cloudflare,
# scoped to THIS tenant via the per-tenant CF token Secret. The
# `letsencrypt-prod-<slug>` naming MUST match tenantClusterIssuerName()
# in backend/cmd/api/tenant_substrate.go — the per-instance overlay
# renderer in instance.go:504 stamps that exact name into every
# Certificate's issuerRef. Pre-0.7.3 charts used the unsuffixed name
# `letsencrypt-prod`, which broke for any instance asking for the
# slugged form (the qsoft2 migrate test on 2026-05-09 surfaced this:
# erp2's Certificate referenced letsencrypt-prod-qsoft, the chart only
# rendered letsencrypt-prod, cert-manager logged "Referenced ClusterIssuer
# not found", erp2 served the Traefik default cert forever).
#
# Multi-zone: the solver has NO `selector.dnsZones` restriction. The
# tenant's Cloudflare token typically covers many zones (a tenant with
# 41 owned domains is normal); we let cert-manager pick whichever zone
# matches the requested host. The token's access is the natural
# boundary — if it can't write a zone, the challenge fails loudly.
#
# Earlier the solver was scoped to `.Values.tenant.domain` only, which
# made instances on ANY other tenant-owned domain unable to issue (the
# `app.havari.me` symptom on a tenant whose primary domain is
# `4th.online`). Dropping the selector unifies single-zone and
# multi-zone tenants under one issuer.
#
# The cloudflare-api-token-<slug> Secret is now chart-managed via the
# ESO ExternalSecret in cloudflare-api-token-externalsecret.yaml (which
# pulls the token from OpenBao at v3/tenants/<id>/cloudflare-token).
# Naming kept symmetric with that template.
#
# Sync wave 1 (Slice 2B.1.1, 2026-05-04). cert-manager itself
# installs at the default wave 0; Argo waits for ALL wave-0
# resources (cert-manager Deployments + webhook Service) to be
# Healthy before applying wave 1. Without this we hit a race:
# Argo applied this ClusterIssuer in the same wave as cert-manager
# Deployments → cert-manager-webhook wasn't Ready yet → admission
# webhook rejected the resource → Argo backed off exponentially
# 30-90s before retrying. retries=2 was the smoking gun in the
# demo-server105 timing analysis (3 min ready instead of ~45 s).
#
# Note ordering: ClusterIssuer at wave 1, Certificate at wave 2
# (in tenants-wildcard-cert.yaml) — Certificate references the
# ClusterIssuer by name, so the resource graph also reflects the
# logical dependency.
#
# Multi-tenant clusters (visiting tenants on a host tenant's cluster)
# remain a known gap (Item #9 follow-up): the ESO ExternalSecret loop
# only iterates the cluster-owner tenant. When a future deploy lands a
# non-owner tenant on a cluster, that tenant's CF Secret + Issuer must
# be applied out-of-band until this template grows a `Values.tenants[]`
# loop and Tower's onboarding code populates it.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod-{{ .Values.tenant.slug }}
annotations:
argocd.argoproj.io/sync-wave: "1"
labels:
app.kubernetes.io/managed-by: cluster-platform-v3
odoosky.io/tenant: {{ .Values.tenant.id | quote }}
spec:
acme:
email: {{ required "acme.email is required" .Values.acme.email | quote }}
server: {{ .Values.acme.server | quote }}
privateKeySecretRef:
# Slug-suffixed so each tenant has its own ACME account key —
# cleaner isolation if a tenant rotates / audits, and avoids
# implicit shared state if two tenants ever land on one cluster.
name: letsencrypt-prod-account-key-{{ .Values.tenant.slug }}
solvers:
- dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-api-token-{{ .Values.tenant.slug }}
key: api-token
{{- end }}