Backend SSE handler already accepts ?tenantId= as an additive filter
on top of canSeeOp (Phase G stays load-bearing); frontend now passes
the global tenant filter chip's value through both NotificationBell
and ActivityTab. Watcher clears + restarts the stream when the
super-admin switches tenant context. Dismissed Set is user-level
and survives the switch.
#2 Dismiss / Clear failed:
- Per-row [×] (text 'Dismiss', shown on hover) on terminal ops
- 'Clear' button in dropdown header when any dismissable rows exist
- Dismissed IDs persisted to localStorage (tower_dismissed_ops)
- Pruned during the 30s sweep when the underlying op falls out
of the recent window — keeps storage from growing forever
- badgeCount + failedCount filter dismissed entries so the red
pill clears the moment the operator acks the failure
#4 Toast popups:
- useToast composable + Toaster.vue mounted at App.vue
- Triggered from NotificationBell.upsert on terminal transition
when the op was running ≥30s AND the bell isn't open
- Success: 5s auto-dismiss; Failure: sticky until clicked away
- Click-through links to the per-instance Activity timeline
(#op_<id>) for any op carrying instanceCode
- Stack of 3, oldest drops on overflow
- No external dep — hand-rolled to match v3's component style
NotificationBell + ActivityTab opened EventSource without auth
(native EventSource API can't set Authorization headers). Phase G's
canSeeOp guard correctly dropped every event for the resulting
anonymous viewer, leaving the bell silent except for the one-shot
backfill on mount.
Backend: claimsFromRequest now falls back to ?token= query param
when the Authorization header is absent. HTTPS-only ingress means
the token stays inside the TLS tunnel; the 15-min access-token TTL
bounds any leakage if it ever surfaces in browser history or proxy
logs.
Frontend: streamOperation + streamAllOperations append the access
token via streamURL(). Plus token-expiry-aware reconnect: on
EventSource error, debounce 5s, close, run authFetch('/me') to let
the 0.61.18 refresh path renew the access token, then re-open with
a fresh streamURL. Without this, the native auto-reconnect would
loop forever with the now-stale token after 15 min.
The grep -oE for instanceCode matches BOTH provenance.instanceCode
AND the v2 root mirror, returning two lines. sed processes each line
but the resulting SOURCE_CODE shell variable was multi-line, which
made the directory check fail (-d "/var/lib/odoo/filestore/odoo16
[newline]odoo16") → rename branch silently skipped → Odoo with
db_name=odoo16v2 looks at /var/lib/odoo/filestore/odoo16v2/, finds
nothing, returns 500 on every asset.
Added head -1 to the pipe so SOURCE_CODE is single-line, plus an
echo so the rename branch's path is visible in Job logs even when
it short-circuits.
Phase C made instance-create tenant-aware for Cloudflare DNS, but
migrate.go and templates_deploy.go kept using the legacy global
*cloudflareClient (zone=odoosky.org). Result: a tenant migrate to
4th.online silently created the A record under odoosky.org as a
literal subdomain ('odoo16v2.tenants.4th.online.odoosky.org' →
right IP) — Tower logged 'DNS A record set' successfully because
the API accepted the call, but the actual hostname the user
browses to was never published to the right zone.
Both flows now use cfResolver.clientFor(tenantID, fqdn) to find
the tenant's CF token + correct zone. If no token covers the
domain, the op fails with a clear 'configure tenant CF token'
message instead of silently writing to the wrong zone.
MigrateDrawer hardcoded '<option value="">Platform server (default)</option>' as
the first picker entry, regardless of tenant scope. A tenant operator
saw it as a selectable default — and selecting it (or just leaving
the default empty) sent the migrate to the platform cluster, which
the operator has no business deploying to.
Now: removed the hardcoded option. Auto-pick the first deployable
non-platform server on load (matches DeployInstanceDrawer pattern).
Picker shows 'Pick a server…' as a disabled placeholder when nothing
is selected.
The export Job was using a stale platform Secret (s3-backup-creds)
and hardcoded bucket/endpoint, so bundles landed in odoosky-v3-backups
while Tower's verify (tenant-scoped after Phase F) read from the
tenant's bucket. Result: 'bundle missing from S3 after job
succeeded' even though the upload itself worked.
Same bug existed in import. Both fixed: keys+region+endpoint+bucket
now come from Tower's resolver view of the tenant, passed directly
into the Job env.
Plus: BackupsView crashed on r.backups.runs.length when runs is
null. Added the missing null guard.
Frontend authFetch was bouncing every 401 straight to /login,
ignoring the 30-day refresh-token cookie the backend already issues.
Result: access-token TTL is 15 min, so the operator was kicked to
login every 15 min of idle.
Now: on 401, authFetch silently calls /api/auth/refresh, retries
the original request once with the new access token, and only
bounces to /login if refresh ALSO fails (refresh cookie expired or
revoked). Concurrent 401s coalesce onto a single in-flight refresh
to avoid rotating the refresh-token jti N times in a burst.
handleUpdateAddons was fully synchronous including BuildKit Jobs per
addon image. cetmix_tower auto-pulls 8 deps × ~30-90s each on a fresh
cluster = 5-15 min. Reverse proxy timeout (60-120s) cuts the request
mid-build, browser shows 'Saving' forever, drawer eventually closes
on the timeout error, AND the cancelled context kills the goroutine
mid-flight so values.yaml never gets committed.
Now: handler validates inline (immediate feedback on bad input),
spawns an addon-stage op, returns 202 + opId in <1s. The goroutine
runs phases (resolve → build → commit → refresh) with a fresh
context that survives client disconnect. Operator watches it in the
bell + Activity tab, can keep working in another tab. Same pattern
Deploy/Migrate/Apply already use.
Two bugs the calm re-review surfaced:
1. substrateGateRejection used ListApplications (instance-only
selector) — same bug we fixed in handleListServers but missed in
the gate function. Result: the BACKEND gate was a pass-through
for every preparing server. Frontend disable held the line in
normal use, but the non-bypassable layer wasn't actually
non-bypassable. Now uses ListClusterPlatformApps.
2. PlatformAppBadge had its own Argo→state mapping that classified
Health=Degraded as 'Failed' immediately. Backend's
deriveSubstrateStatus is forgiving within the budget (Degraded
inside 10-min window reads as preparing). The two parallel
logics meant the badge could show red 'Failed' while the gate
considered it 'preparing'. Refactored badge to consume the
backend's substrateStatus enum directly — single source of
truth, no more disagreement.
DNS-01 against Cloudflare reliably takes 6-8 min for first-issue
(Cloudflare TXT write + global propagation + LE polling). The 5-min
budget red-flagged normal installs. Bumped to 10 min.
Also: Argo flickers to Degraded during retries while waiting on the
Certificate hook. Within the budget window we now classify Degraded
as 'preparing' instead of 'degraded' — only declare degraded after
the budget fully expires. Stops the badge from showing 'Failed'
during a healthy install.
ListApplications uses selector odoosky.io/component=instance which
hides cluster-platform apps. Separate ListClusterPlatformApps method
selects on app.kubernetes.io/managed-by=tower then filters by
-platform suffix — no backfill needed for existing servers.
Without this, substrateStatus + the Phase B4 PlatformAppBadge both
silently fell to 'unknown' / invisible because the lookup map was
empty.