Autonomous Agent Orchestration Platform for Multi-Brand Operations

By Amar Kumar

A proposed autonomous agent orchestration platform for a US-based operator running multiple brands — AI video production, real-estate technology, and regulated medical-compliance clinics. Instead of one-off gigs and copy-paste prompts, the architecture would treat the agent harness (routing, skills, MCP, hooks, observability) as the product, with Claude Code, Kimi, GLM, Minimax, and Qwen routed like an orchestra rather than competing chatbots.

Proposed outcome: One standing platform where overnight planner / generator / evaluator workflows, Supabase data pipelines, custom MCP servers, and operator dashboards share a model router, security layer, and observability stack — so weekly briefs ship without rebuilding the harness each time.

Scenario

This brief describes a proposed solution — not a delivered engagement. It maps a recurring pattern: a multi-brand operator that needs a long-term AI engineering function, not project-by-project contractors.

Brands: AI video production (avatars, character animation, real-estate walkthroughs); AI-powered real-estate technology (MLS, HUD, public-record ingestion); network of medical compliance clinics (regulated workflows, operator tooling)
Owner profile: Direct access to ownership, fast-moving weekly briefs, infrastructure budget (Supabase, Vercel, Cloudflare, Runpod), paid access to frontier and open coding models
Operating model: 30–40 hrs/week standing capacity; harness and platform amortized across brands; lead engineer grows into team lead over time
Scope: Multi-agent overnight jobs, data pipelines, internal dashboards, custom MCP when off-the-shelf tools miss, WordPress/Next.js surfaces, voice/video automation (Whisper, fal.ai, Seedance-class models)

Problem

Treating AI as “better autocomplete” does not scale across three regulated and media-heavy brands. Common failure modes:

Single-shot prompts — no planner/evaluator loop; jobs stall silently overnight with no dead-letter or escalation
Model chaos — engineers pick models ad hoc; no router rules for orchestrator brain vs primary coder vs fallback; cost and quality drift
Siloed stacks — each brand reinvents Supabase schemas, MCP servers, and deploy paths; no shared observability or security baseline
Contract churn — gig hunters finish a brief and leave; institutional knowledge walks out; the harness never compounds
Regulated blind spots — medical compliance and public-record pipelines need prompt-injection guards, audit trails, and human-review gates; chat wrappers skip this
Operator gap — non-technical staff across brands lack dashboards to trigger, monitor, and approve agent output

Requirements

Functional

Model router — explicit rules: orchestrator brain, primary coder, fallback; task-type and cost/latency aware
Overnight multi-agent workflows — planner → generator → evaluator graphs with retries, checkpoints, and morning digest
Real-estate data pipeline — MLS feeds, HUD records, county/public sources → normalized Supabase tables with webhook triggers downstream
Custom MCP servers — brand-specific tools (CRM, MLS bridge, clinic scheduling, media render queue) exposed uniformly to Claude Code and headless agents
Internal operator UI — job queue, approval inbox, run replay, brand switcher
Media automation — Whisper transcription, FFmpeg transforms, fal.ai / Seedance-class video generation in batch pipelines
Web surfaces — WordPress and Next.js sites generated and updated through AI-native workflows with human publish gates

Non-functional

Observability — trace every agent run (Langfuse or Helicone); token/cost attribution per brand and workflow
Security — mitigate “lethal trifecta” (untrusted input + tool access + external comms); sandbox MCP tools; secrets in vault
Reliability — idempotent ETL, dead-letter queues, alert on hung stdio bridges or stuck planner loops
Maintainability — skills, slash commands, hooks, and router config in git; reproducible dev environment (Claude Code / Cursor parity)
Compliance-ready — PHI-adjacent clinic flows get audit logs, PII redaction in traces, optional human-in-the-loop nodes

Architecture

Four layers: a control plane (router, scheduler, observability), a harness (Claude Code + MCP + skills/hooks), brand workflows (LangGraph graphs per domain), and data + media (Supabase, object storage, Runpod workers).

flowchart TB classDef control fill:#ede9fe,stroke:#7c3aed,color:#5b21b6 classDef harness fill:#dbeafe,stroke:#2563eb,color:#1e3a8a classDef workflow fill:#f1f5f9,stroke:#64748b,color:#334155 classDef data fill:#ccfbf1,stroke:#0d9488,color:#115e59 classDef ext fill:#f8fafc,stroke:#475569,color:#334155 RT["Model Router\norchestrator · coder · fallback"]:::control SCH["Scheduler\nTemporal / cron"]:::control OBS["Observability\nLangfuse traces"]:::control RT --> SCH SCH --> OBS CC["Claude Code harness\nskills · hooks · subagents"]:::harness MCP["MCP server fleet\nMLS · CRM · media · WP"]:::harness CC --> MCP RT --> CC RE["Real-estate graph\nplanner · ETL · evaluator"]:::workflow VI["Video graph\ngen · QA · render"]:::workflow CL["Clinic graph\ncompliance · docs"]:::workflow CC --> RE CC --> VI CC --> CL SB[("Supabase\nproperties · jobs · audit")]:::data S3["Object storage\nmedia · exports"]:::data RP["Runpod / VPS\nFFmpeg · GPU"]:::data RE --> SB VI --> S3 VI --> RP CL --> SB MLS["MLS / HUD / public APIs"]:::ext FAL["fal.ai · Whisper"]:::ext WP["WordPress / Next.js"]:::ext RE --> MLS VI --> FAL CL --> WP

Platform architecture — shared control plane and harness, brand-specific LangGraph workflows, Supabase + media workers

sequenceDiagram autonumber participant SCH as Scheduler participant RT as Model Router participant PL as Planner Agent participant GEN as Generator Agent participant EV as Evaluator Agent participant MCP as MCP Tools participant SB as Supabase participant OBS as Langfuse SCH->>RT: enqueue overnight job RT->>PL: route orchestrator model PL->>SB: read backlog + context PL->>OBS: trace plan_id PL->>GEN: task graph + constraints RT->>GEN: route primary coder GEN->>MCP: tool calls (ETL, render, publish) MCP->>SB: upsert rows / artifacts GEN->>EV: candidate output RT->>EV: route evaluator model alt pass EV->>SB: mark complete + notify else fail EV->>PL: requeue with critique end EV->>OBS: score + cost rollup

Overnight sequence — router picks models per role; evaluator gates promotion to production

Component map by platform layer (major services per tier)

End-to-end flow

Weekly brief → Router + harness → Overnight graph → Evaluator gate → Operator inbox → Production

From ownership brief to production — shared harness, brand workflow, human approval where required

Illustrative model routing mix by agent role (% of routed calls in a typical week)

Indicative standing-engineering capacity split across brand domains (% of weekly hours)

Recommended stack

Recommendation: Claude Code (or Cursor-equivalent harness) as the daily driver; LangGraph for durable overnight graphs; Supabase for operational data and realtime operator UI; a YAML-driven model router with cost caps; Temporal or Cloudflare Workers cron for schedules; Langfuse for traces.

Layer	Technology	Why
Daily harness	Claude Code + skills/hooks/MCP	Subagents, repo-aware edits, repeatable slash commands — the “framework” layer
Model router	Custom router service + YAML rules	Explicit orchestrator / coder / fallback; routes Claude, Kimi K2, GLM, Minimax, Qwen by task type and budget
Overnight orchestration	LangGraph + Python 3.11	Checkpointed planner/generator/evaluator graphs with human-in-the-loop nodes
Scheduling	Temporal Cloud or Supabase pg_cron	Reliable overnight runs, retries, visibility into stuck workflows
Data plane	Supabase (Postgres + Edge Functions)	MLS/HUD normalized schema, webhooks, Row Level Security per brand
MCP fleet	TypeScript + Python MCP servers	Uniform tool surface for harness and headless agents; versioned in monorepo
Media workers	Runpod / VPS + FFmpeg + fal.ai	GPU bursts for video; CPU workers for transcode and Whisper batch
Web	Next.js on Vercel + WordPress REST	Fast marketing surfaces; existing WP estates stay integrated via MCP
Observability	Langfuse (self-hosted or cloud)	Trace spans per agent, prompt/version tags, cost by brand
Secrets	Cloudflare Workers secrets / Doppler	Central rotation; no keys in agent prompts or repos

Why not a single model everywhere? Orchestration benefits from a strong reasoning model; bulk codegen and ETL transforms can run on open models at lower cost; evaluators may use a different model to reduce self-confirmation bias. The router encodes these rules explicitly instead of “pick what feels best.”

Why not n8n-only? Multi-step agent QA, MCP tool auth, and checkpointed overnight graphs outgrow visual chains. Use n8n only for lightweight webhook fan-out (Slack, email digests).

Agent & component design

Model router — routing rules (example)

Role	Default model	Fallback	Rule
Orchestrator brain	Claude Sonnet / Opus class	GPT-4.1	Planning, decomposition, tool-selection — always highest reasoning tier under daily cost cap
Primary coder	Claude Code default	Kimi K2 or Qwen Coder	Repo edits and MCP tool loops; switch to open model when task tag is `bulk_etl` or token estimate > 80k
Evaluator	Different family than generator	GLM or Minimax	Structured rubric JSON; reject if generator and evaluator share same model ID
Embeddings / classify	Small open model	Hosted embed API	Router pre-step; never burn frontier tokens on routing labels

1 — Planner agent

Input: weekly brief JSON, backlog tables, prior run critiques
Output: DAG of tasks with model hints, MCP tool list, acceptance criteria, ETA
QA gate: plan must reference existing schemas/MCP versions — no greenfield table names without migration stub

2 — Generator agent (domain variants)

Real-estate: pull MLS/HUD deltas, map to Supabase, trigger property webhooks
Video: script → voice (Whisper/TTS) → fal.ai render → FFmpeg concat → storage URL
Clinic ops: generate compliance checklists and operator docs; never auto-send external comms without approval node
Web: Next.js/WordPress component drafts via MCP publish tools (draft-only by default)

3 — Evaluator agent

Rubric scores: schema valid, test pass, diff size bounds, policy (no PII in logs, no prompt injection markers)
On fail: structured critique back to planner with max retry budget; dead-letter after N attempts
Emits Langfuse score events for weekly quality trends

4 — MCP server fleet (shared)

mls-bridge — RESO/RETS or vendor API → staged rows
supabase-ops — typed CRUD with RLS context per brand
media-queue — enqueue Runpod jobs, poll status
wordpress-mcp — draft posts, media upload, meta fields
clinic-docs — template fill, PDF export, audit log write

5 — Security envelope

Untrusted web content never flows directly into tools with write access — sanitize + allowlist domains
MCP tools scoped per brand via JWT claims; read-only tools default
Hooks block commits containing secrets; pre-tool-call hook strips injection patterns from fetched HTML

Suggested phase timeline (weeks) for platform foundation through first overnight production graph

Implementation plan

Phase 1 — Harness & router foundation (week 1–2)

Monorepo layout: router/, mcp/, graphs/, skills/. Claude Code skills for deploy, test, and trace replay. YAML router with three roles and cost caps. Langfuse project per brand. Dev Supabase with RLS skeleton.

Risk: Model API variance — stub adapters early. Rollback: manual Claude Code sessions without overnight scheduler until router stable.

Phase 2 — MCP server fleet v1 (week 3–4)

Ship supabase-ops and wordpress-mcp; stub mls-bridge with sample feed. Document tool contracts in OpenAPI-style markdown. Integration tests that run headless against local Supabase.

Risk: MLS vendor access delays — use public HUD/sample RESO sandbox. Rollback: generators write to staging schema only.

Phase 3 — Real-estate ETL graph (week 5–6)

LangGraph overnight job: ingest → normalize → dedupe → webhook emit. Idempotent upserts on natural keys (APN, listing ID). Operator dashboard: last run, row counts, error samples.

Risk: Silent hang on subprocess stdin — spawn MCP and ETL children with piped stdio and watchdog timeouts. Rollback: disable webhooks; keep tables updating.

Phase 4 — Video & media pipeline (week 7–8)

Whisper batch transcode, fal.ai render queue MCP, FFmpeg concat worker on Runpod. Evaluator checks duration, resolution, and brand template compliance. Object storage URLs written to Supabase.

Phase 5 — Operator console & clinic workflows (week 9–10)

Next.js internal app: job inbox, approve/reject, trace deep-link to Langfuse. Clinic graph with human-in-the-loop on any external-facing output. Audit table immutable append-only.

Risk: Regulated content — default deny publish without operator click. Rollback: draft-only mode across all publish MCP tools.

Phase 6 — Hardening & runbooks (week 11–12)

On-call runbook: stuck planner, runaway token spend, MCP OOM, ETL duplicate keys. Load test overnight queue. Playbook for adding a new brand tenant (RLS policy + Langfuse project + router budget line).

Reporting & ops

Signal	Source	Cadence
Agent traces, latency, token cost	Langfuse dashboards	Real-time; daily Slack rollup
Overnight job pass/fail rate	Supabase `job_runs`	Per run; weekly trend
ETL freshness (MLS/HUD)	Supabase `ingest_watermarks`	Alert if > SLA hours stale
Media queue depth	Runpod + `media_jobs`	Alert on queue > N or failure rate spike
Router fallback frequency	Router logs	Weekly — indicates primary model outages or cost cap hits
Evaluator rejection reasons	Langfuse scores + `critiques` JSON	Weekly engineering retro input

Morning digest to ownership: completed overnight jobs, items awaiting approval, cost vs budget, and any dead-letter entries with one-click trace links. On-call rotation would use PagerDuty or Slack escalation only for SLA breaches (ETL stale, zero successful overnight runs, runaway spend).

Proposed deliverables

Following the phased plan, a build would ship these artifacts:

YAML-driven model router with orchestrator, coder, and evaluator roles plus cost caps and fallback rules
Claude Code skill pack: deploy, test, trace-replay, and brand-context switchers
MCP server monorepo (Supabase, MLS bridge, WordPress, media queue, clinic docs) with contract tests
LangGraph overnight graphs for real-estate ETL, video generation, and clinic document workflows
Supabase schema: multi-tenant RLS, job audit tables, ingest watermarks, operator approval queue
Next.js operator console with inbox, approvals, and Langfuse deep links
Runpod FFmpeg/Whisper worker images and fal.ai integration with evaluator rubrics
Security hooks (injection sanitizer, secret scanner) and runbooks for stuck agents and ETL failures

Effort estimate

Indicative effort for platform foundation through first production overnight graphs across two brand workflows (assumes MLS sandbox or sample feeds available, Supabase/Vercel/Runpod accounts provisioned):

Scope	Hours (range)
Platform foundation (phases 1–6)	280–360 hrs
Standing weekly engineering (post-foundation)	30–40 hrs/week ongoing
Platform maintenance (router tuning, MCP upgrades)	12–20 hrs/month

The ongoing weekly hours reflect the operating model: recurring briefs across brands, not a one-off handoff. Initial platform build is a one-time investment; subsequent briefs reuse the harness.

Glossary

Term	Meaning
Agent harness	Skills, hooks, MCP, subagents, and router config around the LLM — the durable product layer
Claude Code	Anthropic’s agentic coding environment with repo context and tool use
MCP	Model Context Protocol — standard for exposing tools and data sources to agents
LangGraph	Library for checkpointed multi-step agent workflows with branches and retries
Model router	Service that picks orchestrator, coder, and evaluator models from explicit rules
Lethal trifecta	Risk pattern: untrusted input + privileged tools + external communications without guards
RESO / RETS	Real-estate data standards and legacy MLS transport protocols
Dead-letter queue	Storage for jobs that exhausted retries — requires human inspection
Langfuse	Open-source LLM observability — traces, scores, prompt versioning, cost attribution