Autonomous Agent Orchestration Platform for Multi-Brand Operations
A proposed autonomous agent orchestration platform for a US-based operator running multiple brands — AI video production, real-estate technology, and regulated medical-compliance clinics. Instead of one-off gigs and copy-paste prompts, the architecture would treat the agent harness (routing, skills, MCP, hooks, observability) as the product, with Claude Code, Kimi, GLM, Minimax, and Qwen routed like an orchestra rather than competing chatbots.
Proposed outcome: One standing platform where overnight planner / generator / evaluator workflows, Supabase data pipelines, custom MCP servers, and operator dashboards share a model router, security layer, and observability stack — so weekly briefs ship without rebuilding the harness each time.
Scenario
This brief describes a proposed solution — not a delivered engagement. It maps a recurring pattern: a multi-brand operator that needs a long-term AI engineering function, not project-by-project contractors.
- Brands: AI video production (avatars, character animation, real-estate walkthroughs); AI-powered real-estate technology (MLS, HUD, public-record ingestion); network of medical compliance clinics (regulated workflows, operator tooling)
- Owner profile: Direct access to ownership, fast-moving weekly briefs, infrastructure budget (Supabase, Vercel, Cloudflare, Runpod), paid access to frontier and open coding models
- Operating model: 30–40 hrs/week standing capacity; harness and platform amortized across brands; lead engineer grows into team lead over time
- Scope: Multi-agent overnight jobs, data pipelines, internal dashboards, custom MCP when off-the-shelf tools miss, WordPress/Next.js surfaces, voice/video automation (Whisper, fal.ai, Seedance-class models)
Problem
Treating AI as “better autocomplete” does not scale across three regulated and media-heavy brands. Common failure modes:
- Single-shot prompts — no planner/evaluator loop; jobs stall silently overnight with no dead-letter or escalation
- Model chaos — engineers pick models ad hoc; no router rules for orchestrator brain vs primary coder vs fallback; cost and quality drift
- Siloed stacks — each brand reinvents Supabase schemas, MCP servers, and deploy paths; no shared observability or security baseline
- Contract churn — gig hunters finish a brief and leave; institutional knowledge walks out; the harness never compounds
- Regulated blind spots — medical compliance and public-record pipelines need prompt-injection guards, audit trails, and human-review gates; chat wrappers skip this
- Operator gap — non-technical staff across brands lack dashboards to trigger, monitor, and approve agent output
Requirements
Functional
- Model router — explicit rules: orchestrator brain, primary coder, fallback; task-type and cost/latency aware
- Overnight multi-agent workflows — planner → generator → evaluator graphs with retries, checkpoints, and morning digest
- Real-estate data pipeline — MLS feeds, HUD records, county/public sources → normalized Supabase tables with webhook triggers downstream
- Custom MCP servers — brand-specific tools (CRM, MLS bridge, clinic scheduling, media render queue) exposed uniformly to Claude Code and headless agents
- Internal operator UI — job queue, approval inbox, run replay, brand switcher
- Media automation — Whisper transcription, FFmpeg transforms, fal.ai / Seedance-class video generation in batch pipelines
- Web surfaces — WordPress and Next.js sites generated and updated through AI-native workflows with human publish gates
Non-functional
- Observability — trace every agent run (Langfuse or Helicone); token/cost attribution per brand and workflow
- Security — mitigate “lethal trifecta” (untrusted input + tool access + external comms); sandbox MCP tools; secrets in vault
- Reliability — idempotent ETL, dead-letter queues, alert on hung stdio bridges or stuck planner loops
- Maintainability — skills, slash commands, hooks, and router config in git; reproducible dev environment (Claude Code / Cursor parity)
- Compliance-ready — PHI-adjacent clinic flows get audit logs, PII redaction in traces, optional human-in-the-loop nodes
Architecture
Four layers: a control plane (router, scheduler, observability), a harness (Claude Code + MCP + skills/hooks), brand workflows (LangGraph graphs per domain), and data + media (Supabase, object storage, Runpod workers).
Platform architecture — shared control plane and harness, brand-specific LangGraph workflows, Supabase + media workers
Overnight sequence — router picks models per role; evaluator gates promotion to production
Component map by platform layer (major services per tier)
End-to-end flow
From ownership brief to production — shared harness, brand workflow, human approval where required
Illustrative model routing mix by agent role (% of routed calls in a typical week)
Indicative standing-engineering capacity split across brand domains (% of weekly hours)
Recommended stack
Recommendation: Claude Code (or Cursor-equivalent harness) as the daily driver; LangGraph for durable overnight graphs; Supabase for operational data and realtime operator UI; a YAML-driven model router with cost caps; Temporal or Cloudflare Workers cron for schedules; Langfuse for traces.
| Layer | Technology | Why |
|---|---|---|
| Daily harness | Claude Code + skills/hooks/MCP | Subagents, repo-aware edits, repeatable slash commands — the “framework” layer |
| Model router | Custom router service + YAML rules | Explicit orchestrator / coder / fallback; routes Claude, Kimi K2, GLM, Minimax, Qwen by task type and budget |
| Overnight orchestration | LangGraph + Python 3.11 | Checkpointed planner/generator/evaluator graphs with human-in-the-loop nodes |
| Scheduling | Temporal Cloud or Supabase pg_cron | Reliable overnight runs, retries, visibility into stuck workflows |
| Data plane | Supabase (Postgres + Edge Functions) | MLS/HUD normalized schema, webhooks, Row Level Security per brand |
| MCP fleet | TypeScript + Python MCP servers | Uniform tool surface for harness and headless agents; versioned in monorepo |
| Media workers | Runpod / VPS + FFmpeg + fal.ai | GPU bursts for video; CPU workers for transcode and Whisper batch |
| Web | Next.js on Vercel + WordPress REST | Fast marketing surfaces; existing WP estates stay integrated via MCP |
| Observability | Langfuse (self-hosted or cloud) | Trace spans per agent, prompt/version tags, cost by brand |
| Secrets | Cloudflare Workers secrets / Doppler | Central rotation; no keys in agent prompts or repos |
Why not a single model everywhere? Orchestration benefits from a strong reasoning model; bulk codegen and ETL transforms can run on open models at lower cost; evaluators may use a different model to reduce self-confirmation bias. The router encodes these rules explicitly instead of “pick what feels best.”
Why not n8n-only? Multi-step agent QA, MCP tool auth, and checkpointed overnight graphs outgrow visual chains. Use n8n only for lightweight webhook fan-out (Slack, email digests).
Agent & component design
Model router — routing rules (example)
| Role | Default model | Fallback | Rule |
|---|---|---|---|
| Orchestrator brain | Claude Sonnet / Opus class | GPT-4.1 | Planning, decomposition, tool-selection — always highest reasoning tier under daily cost cap |
| Primary coder | Claude Code default | Kimi K2 or Qwen Coder | Repo edits and MCP tool loops; switch to open model when task tag is bulk_etl or token estimate > 80k |
| Evaluator | Different family than generator | GLM or Minimax | Structured rubric JSON; reject if generator and evaluator share same model ID |
| Embeddings / classify | Small open model | Hosted embed API | Router pre-step; never burn frontier tokens on routing labels |
1 — Planner agent
- Input: weekly brief JSON, backlog tables, prior run critiques
- Output: DAG of tasks with model hints, MCP tool list, acceptance criteria, ETA
- QA gate: plan must reference existing schemas/MCP versions — no greenfield table names without migration stub
2 — Generator agent (domain variants)
- Real-estate: pull MLS/HUD deltas, map to Supabase, trigger property webhooks
- Video: script → voice (Whisper/TTS) → fal.ai render → FFmpeg concat → storage URL
- Clinic ops: generate compliance checklists and operator docs; never auto-send external comms without approval node
- Web: Next.js/WordPress component drafts via MCP publish tools (draft-only by default)
3 — Evaluator agent
- Rubric scores: schema valid, test pass, diff size bounds, policy (no PII in logs, no prompt injection markers)
- On fail: structured critique back to planner with max retry budget; dead-letter after N attempts
- Emits Langfuse score events for weekly quality trends
4 — MCP server fleet (shared)
- mls-bridge — RESO/RETS or vendor API → staged rows
- supabase-ops — typed CRUD with RLS context per brand
- media-queue — enqueue Runpod jobs, poll status
- wordpress-mcp — draft posts, media upload, meta fields
- clinic-docs — template fill, PDF export, audit log write
5 — Security envelope
- Untrusted web content never flows directly into tools with write access — sanitize + allowlist domains
- MCP tools scoped per brand via JWT claims; read-only tools default
- Hooks block commits containing secrets; pre-tool-call hook strips injection patterns from fetched HTML
Suggested phase timeline (weeks) for platform foundation through first overnight production graph
Implementation plan
Phase 1 — Harness & router foundation (week 1–2)
Monorepo layout: router/, mcp/, graphs/, skills/. Claude Code skills for deploy, test, and trace replay. YAML router with three roles and cost caps. Langfuse project per brand. Dev Supabase with RLS skeleton.
Risk: Model API variance — stub adapters early. Rollback: manual Claude Code sessions without overnight scheduler until router stable.
Phase 2 — MCP server fleet v1 (week 3–4)
Ship supabase-ops and wordpress-mcp; stub mls-bridge with sample feed. Document tool contracts in OpenAPI-style markdown. Integration tests that run headless against local Supabase.
Risk: MLS vendor access delays — use public HUD/sample RESO sandbox. Rollback: generators write to staging schema only.
Phase 3 — Real-estate ETL graph (week 5–6)
LangGraph overnight job: ingest → normalize → dedupe → webhook emit. Idempotent upserts on natural keys (APN, listing ID). Operator dashboard: last run, row counts, error samples.
Risk: Silent hang on subprocess stdin — spawn MCP and ETL children with piped stdio and watchdog timeouts. Rollback: disable webhooks; keep tables updating.
Phase 4 — Video & media pipeline (week 7–8)
Whisper batch transcode, fal.ai render queue MCP, FFmpeg concat worker on Runpod. Evaluator checks duration, resolution, and brand template compliance. Object storage URLs written to Supabase.
Phase 5 — Operator console & clinic workflows (week 9–10)
Next.js internal app: job inbox, approve/reject, trace deep-link to Langfuse. Clinic graph with human-in-the-loop on any external-facing output. Audit table immutable append-only.
Risk: Regulated content — default deny publish without operator click. Rollback: draft-only mode across all publish MCP tools.
Phase 6 — Hardening & runbooks (week 11–12)
On-call runbook: stuck planner, runaway token spend, MCP OOM, ETL duplicate keys. Load test overnight queue. Playbook for adding a new brand tenant (RLS policy + Langfuse project + router budget line).
Reporting & ops
| Signal | Source | Cadence |
|---|---|---|
| Agent traces, latency, token cost | Langfuse dashboards | Real-time; daily Slack rollup |
| Overnight job pass/fail rate | Supabase job_runs | Per run; weekly trend |
| ETL freshness (MLS/HUD) | Supabase ingest_watermarks | Alert if > SLA hours stale |
| Media queue depth | Runpod + media_jobs | Alert on queue > N or failure rate spike |
| Router fallback frequency | Router logs | Weekly — indicates primary model outages or cost cap hits |
| Evaluator rejection reasons | Langfuse scores + critiques JSON | Weekly engineering retro input |
Morning digest to ownership: completed overnight jobs, items awaiting approval, cost vs budget, and any dead-letter entries with one-click trace links. On-call rotation would use PagerDuty or Slack escalation only for SLA breaches (ETL stale, zero successful overnight runs, runaway spend).
Proposed deliverables
Following the phased plan, a build would ship these artifacts:
- YAML-driven model router with orchestrator, coder, and evaluator roles plus cost caps and fallback rules
- Claude Code skill pack: deploy, test, trace-replay, and brand-context switchers
- MCP server monorepo (Supabase, MLS bridge, WordPress, media queue, clinic docs) with contract tests
- LangGraph overnight graphs for real-estate ETL, video generation, and clinic document workflows
- Supabase schema: multi-tenant RLS, job audit tables, ingest watermarks, operator approval queue
- Next.js operator console with inbox, approvals, and Langfuse deep links
- Runpod FFmpeg/Whisper worker images and fal.ai integration with evaluator rubrics
- Security hooks (injection sanitizer, secret scanner) and runbooks for stuck agents and ETL failures
Effort estimate
Indicative effort for platform foundation through first production overnight graphs across two brand workflows (assumes MLS sandbox or sample feeds available, Supabase/Vercel/Runpod accounts provisioned):
| Scope | Hours (range) |
|---|---|
| Platform foundation (phases 1–6) | 280–360 hrs |
| Standing weekly engineering (post-foundation) | 30–40 hrs/week ongoing |
| Platform maintenance (router tuning, MCP upgrades) | 12–20 hrs/month |
The ongoing weekly hours reflect the operating model: recurring briefs across brands, not a one-off handoff. Initial platform build is a one-time investment; subsequent briefs reuse the harness.
Glossary
| Term | Meaning |
|---|---|
| Agent harness | Skills, hooks, MCP, subagents, and router config around the LLM — the durable product layer |
| Claude Code | Anthropic’s agentic coding environment with repo context and tool use |
| MCP | Model Context Protocol — standard for exposing tools and data sources to agents |
| LangGraph | Library for checkpointed multi-step agent workflows with branches and retries |
| Model router | Service that picks orchestrator, coder, and evaluator models from explicit rules |
| Lethal trifecta | Risk pattern: untrusted input + privileged tools + external communications without guards |
| RESO / RETS | Real-estate data standards and legacy MLS transport protocols |
| Dead-letter queue | Storage for jobs that exhausted retries — requires human inspection |
| Langfuse | Open-source LLM observability — traces, scores, prompt versioning, cost attribution |