Marketing Attribution Pipeline — S3, BigQuery, and Looker Studio
A proposed marketing attribution pipeline extension for a team that already moves lead and case-intake data from Lead Docket through AWS S3 into BigQuery, surfaced in Looker Studio via a scheduled Cloud Function refresh. The build would add daily Google Ads and Meta campaign ingestion, a matching layer tying each lead to its driving campaign (gclid, UTMs, source mapping), and pre-aggregated BigQuery views at campaign and ad-group grain — wired into the existing Looker dashboard with consistent styling.
Proposed outcome: One law-firm client dashboard showing which campaigns drive leads, cost per lead, lead quality, and signed/referred case breakdown by campaign — with unmatched leads handled explicitly, not dropped.
Scenario
This brief describes a proposed solution — not a delivered engagement. It maps a recurring pattern in legal marketing analytics: intake data already pipelines to BigQuery, but ad spend lives in separate silos.
- Organization: Analytics or marketing ops team serving law-firm clients; attribution strategy, KPI definitions, and matching logic already scoped internally
- Existing pipeline: Lead Docket → AWS S3 → BigQuery → Looker Studio (Cloud Function triggered by Cloud Scheduler)
- Extension scope: Google Ads + Meta spend/clicks/impressions joined to intake data for one client initially, reusable for additional clients
- Owner profile: Contract data engineer owning the build against a provided spec — strong SQL, ETL, and Looker experience
- Attribution inputs: gclid for Google; UTMs + client-provided source-to-campaign mapping for both platforms
- Downstream consumers: Client stakeholders reviewing campaign ROI, cost per lead, and signed/referred case rates in Looker Studio
Problem
The intake pipeline works — leads land in BigQuery and appear in Looker. But marketing spend data lives in Google Ads and Meta Ads Manager, so the team cannot answer basic attribution questions without manual exports and spreadsheet joins.
- No spend-to-lead join — campaign cost and lead volume require manual reconciliation across platforms each week
- Fragmented click identifiers — gclid on Google leads, fbclid on Meta, UTMs inconsistently captured in Lead Docket intake forms
- Lead quality invisible at campaign grain — signed cases and referred outcomes exist in intake data but are not tied back to ad spend
- Existing pipeline undocumented — extending S3 → BigQuery without an audit risks breaking the intake refresh the client already depends on
- Looker performance — blending raw ad and lead tables at query time is slow; pre-aggregated marts at campaign grain are required
- Unmatched leads — not every lead carries a clean gclid or UTM; dropping them inflates ROI; including them requires an explicit unmatched bucket with visible match rate
Requirements
Functional
- Pipeline audit — document existing S3 → BigQuery → Looker flow, schemas, schedules, and Cloud Function logic before any changes
- Google Ads ingestion — daily load of spend, clicks, impressions, campaign/ad-group metadata, click IDs where available
- Meta ingestion — daily load of equivalent Meta campaign metrics via Marketing API or connector
- Attribution matching layer — join leads to campaigns via gclid (Google), UTMs + source mapping (both platforms); graceful unmatched handling
- BigQuery modeling — clean views at campaign and ad-group grain; pre-aggregated for Looker speed
- Lead quality metrics — signed cases, referred cases, and quality flags broken down by campaign per scoped KPI definitions
- Looker Studio extension — new pages on existing dashboard matching styling and naming conventions
- Documentation & handoff — pipeline diagram, schema docs, runbook for daily ops
Non-functional
- Consistency — ad data lands on the same daily schedule and naming pattern as existing intake loads
- Idempotency — daily jobs safe to re-run without duplicate rows (MERGE upsert pattern)
- Observability — row counts, match rates, and ingestion failures logged and alertable
- Cost control — partitioned BigQuery tables, clustered on date and campaign ID
- Extensibility — client-agnostic attribution SQL; client-specific rules in config mapping tables
Architecture
Three layers: ingestion (existing S3 intake unchanged + new ad platform loads), attribution warehouse (matching + marts in BigQuery), and reporting (Looker Studio on pre-aggregated views).
Pipeline architecture — existing S3 intake path unchanged; new ad platform loads join through an attribution matching layer into pre-aggregated marts for Looker
Attribution matching — priority cascade from gclid exact match through UTM mapping to explicit unmatched bucket; no silent drops
Component map by pipeline tier (major services per layer)
End-to-end flow
Daily cycle — ad spend and intake leads converge in BigQuery marts before surfacing in Looker Studio
Indicative lead match method distribution after attribution layer (typical legal lead-gen mix)
Typical daily pipeline duration by stage — median vs p95 (illustrative)
Recommended stack
Recommendation: keep the existing S3 → Cloud Function → BigQuery intake path untouched; add Google Ads BigQuery Data Transfer (or API) and a Meta Marketing API Cloud Function; model attribution in BigQuery SQL views with config mapping tables; connect Looker Studio to pre-aggregated marts only.
| Layer | Technology | Why |
|---|---|---|
| Intake landing | AWS S3 (existing) | Lead Docket export path already proven; no migration needed |
| Intake load | Cloud Function + Cloud Scheduler (existing) | Matches current refresh pattern; audit before extending |
| Google Ads | BigQuery Data Transfer Service | Zero-maintenance daily campaign metrics; supplement with API for click-level detail if needed |
| Meta Ads | Marketing API → Cloud Function → BigQuery | No native BQ transfer for Meta; scheduled pull with pagination and rate-limit handling |
| Warehouse | BigQuery | Already in use; partitioned raw tables, view-based marts, scheduled query refresh |
| Attribution logic | BigQuery SQL + mapping tables | Transparent, version-controlled joins; client updates source map without redeploying SQL |
| Reporting | Looker Studio (existing dashboard) | Client already uses it; extend with new pages on mart views, not raw blends |
| Orchestration | Cloud Scheduler | Daily cadence after ad platform close; chain: ingest → match → mart refresh |
| Monitoring | Cloud Logging + optional Slack webhook | Row-count checks, match-rate alerts, zero-row ingestion failures |
Why mapping tables over hardcoded CASE statements? Client source-to-campaign rules will change as they add landing pages and rename campaigns. A dim_source_campaign_map table lets ops update mappings in a spreadsheet import without touching SQL.
Why pre-aggregated marts? Looker Studio blends on wide raw tables are slow and error-prone. mart_campaign_daily at the correct grain gives sub-second dashboard loads and prevents accidental many-to-many spend inflation.
Attribution matching logic (simplified SQL pattern for fact_lead_attribution):
SELECT
l.lead_id,
l.created_at,
COALESCE(g.campaign_id, u.campaign_id, s.campaign_id) AS campaign_id,
CASE
WHEN g.campaign_id IS NOT NULL THEN 'gclid'
WHEN u.campaign_id IS NOT NULL THEN 'utm_source_map'
WHEN s.campaign_id IS NOT NULL THEN 'source_fallback'
ELSE 'unmatched'
END AS match_method
FROM fact_leads l
LEFT JOIN dim_gclid_campaign g ON l.gclid = g.gclid
LEFT JOIN dim_source_campaign_map u
ON l.utm_source = u.source AND l.utm_campaign = u.utm_campaign
LEFT JOIN dim_source_campaign_map s
ON l.utm_source = s.source AND s.utm_campaign IS NULL
WHERE l.client_id = @client_id
Component design
1 — Pipeline auditor & documenter
- Input: existing S3 buckets, Cloud Function source, BigQuery datasets, Looker data sources
- Output: architecture diagram, schema catalog, schedule matrix, dependency map
- QA gate: stakeholder sign-off before any schema changes to intake tables
2 — Google Ads ingestion job
- Input: Google Ads account credentials, daily incremental date range
- Output:
raw_google_adspartitioned table — campaign_id, ad_group_id, date, spend, clicks, impressions - Schedule: daily after midnight account timezone; idempotent MERGE on date + campaign + ad_group
3 — Meta ingestion job
- Input: Meta Marketing API token, ad account ID
- Output:
raw_meta_adswith parallel schema to Google for downstream union - Schedule: same daily window; exponential backoff on rate limits; batch by campaign for large accounts
4 — Attribution matching engine
- Input:
fact_leads,raw_google_ads,raw_meta_ads,dim_source_campaign_map - Matching priority: (1) gclid exact match, (2) UTM campaign + source via mapping table, (3) source-only fallback, (4) explicit
unmatchedbucket - Output:
fact_lead_attributionwithmatch_method,match_confidence,campaign_id,platform - Edge cases: duplicate gclids, expired click IDs, partial UTMs — each gets explicit match_method, never silent drop
5 — Campaign mart views
- Grain: campaign-day and ad-group-day pre-aggregates
- Metrics: spend, clicks, impressions, leads, cost per lead, signed cases, referred cases, match rate
- Output:
mart_campaign_daily,mart_ad_group_daily— Looker connects here only
6 — Looker Studio layer
| Page | Metrics | Source |
|---|---|---|
| Campaign ROI | Spend, leads, CPL, impressions, CTR | mart_campaign_daily |
| Lead quality | Signed cases, referred cases, conversion rate by campaign | mart + intake join |
| Ad group detail | Same metrics at ad-group grain | mart_ad_group_daily |
| Unmatched leads | Leads with no campaign match; source/UTM breakdown | fact_lead_attribution |
| Match rate trend | Daily % gclid / UTM / unmatched over time | fact_lead_attribution |
Indicative build effort distribution by phase (% of total hours)
Implementation plan
Phase 1 — Audit & baseline (week 1)
Document existing S3 → BigQuery → Looker pipeline. Export current schemas, review Cloud Function code, inventory Looker data sources. Establish dev/staging datasets separate from production intake tables.
Risk: Undocumented Cloud Function edge cases — pair with whoever built the intake load in week 1. Rollback: read-only audit; no production changes.
Phase 2 — Ad platform ingestion (week 2–3)
Configure Google Ads BigQuery Data Transfer (or API job) into raw_google_ads. Build Meta Marketing API Cloud Function into raw_meta_ads. Wire Cloud Scheduler, row-count monitoring, 7-day backfill validation against platform UI totals.
Risk: Meta API pagination on large accounts — batch by campaign ID. Rollback: disable new schedulers; raw ad tables isolated from intake pipeline.
Phase 3 — Attribution matching layer (week 4–5)
Load client source-to-campaign mapping into dim_source_campaign_map. Build fact_lead_attribution with match priority logic. Publish match-rate KPI dashboard (% gclid, % UTM, % source fallback, % unmatched).
Risk: Low gclid capture rate in Lead Docket — quantify before finalizing match priority; UTM fallback may become primary path. Rollback: attribution views only; intake tables untouched.
Phase 4 — BigQuery marts (week 6)
Build pre-aggregated mart_campaign_daily and mart_ad_group_daily. Partition by date, cluster by campaign_id. Scheduled query refresh chained after daily ingestion completes. Grain tests to prevent spend inflation from many-to-many joins.
Risk: Cartesian join inflating spend — enforce unique grain assertions in scheduled test query. Rollback: drop mart views; raw and attribution tables remain queryable.
Phase 5 — Looker Studio & handoff (week 7–8)
Extend existing dashboard with campaign ROI, cost per lead, signed/referred breakdown, and unmatched leads pages — matching styling and naming conventions. Documentation, runbook, recorded walkthrough for internal team.
Risk: Looker blend row limits on wide joins — connect only to mart views, never raw blends. Rollback: hide new pages until client validates numbers against platform UI.
Reporting & ops
| Signal | Source | Cadence |
|---|---|---|
| Ingestion row counts (Google, Meta, intake) | Cloud Function logs | Daily; alert on zero rows |
| Attribution match rate | fact_lead_attribution | Daily Looker tile; weekly trend review |
| Unmatched lead volume | match_method = 'unmatched' | Weekly — indicates UTM capture gaps |
| Cost per lead by campaign | mart_campaign_daily | Real-time in Looker |
| Signed case rate by campaign | mart + intake quality flags | Weekly client report |
| Spend reconciliation vs platform UI | raw tables vs Ads Manager totals | Weekly during first month; monthly after |
| Pipeline job failures | Cloud Scheduler + error reporting | Immediate Slack alert |
Ops cadence would include a weekly 15-minute pipeline check (row counts, match rate, unmatched trend) and a monthly review with the client stakeholder (campaign ROI, signed case rates, mapping table updates needed). On-call alerts only for zero-row ingestion or mart refresh failure — not per-lead events.
Proposed deliverables
Following the phased plan, a build would ship these artifacts:
- Pipeline audit document with architecture diagram, schema catalog, and schedule matrix
- Daily Google Ads ingestion into BigQuery (Data Transfer or API)
- Daily Meta Ads ingestion Cloud Function with rate-limit handling
dim_source_campaign_mapconfig table and spreadsheet loaderfact_lead_attributionmatching layer with explicit unmatched bucket and match_method columnmart_campaign_dailyandmart_ad_group_dailypre-aggregated views- Looker Studio pages on existing dashboard — campaign ROI, lead quality, unmatched report, match rate trend
- Runbook: daily ops, re-run failed jobs, update source mappings, onboard additional client
- Recorded handoff walkthrough for internal team
Effort estimate
Indicative effort for phases 1–5 (assumes access to existing pipeline, Google/Meta API credentials, and client mapping spreadsheet):
| Scope | Hours (range) |
|---|---|
| Phases 1–5 (full build) | 120–180 hrs |
| Ongoing maintenance (mapping updates, API changes) | 4–8 hrs/month |
| Additional law-firm client (reuse marts + mapping) | 20–40 hrs each |
At 15–25 hrs/week, delivery would land in roughly 6–10 weeks. Milestone-based pricing aligned to phases 1–5 de-risks the engagement — each phase ships a testable increment (ingestion validated before attribution, attribution validated before Looker pages go live).
Glossary
| Term | Meaning |
|---|---|
| gclid | Google Click Identifier — ties a lead session to a specific Google Ads click |
| fbclid | Meta Click Identifier — equivalent click token for Facebook/Instagram ads |
| UTM parameters | Tracking tags (utm_source, utm_campaign, utm_medium) appended to landing page URLs |
| Lead Docket | Legal intake platform capturing lead and case data exported to S3 |
| Attribution matching | Logic linking a lead record to the ad campaign that drove it |
| Match rate | Percentage of leads successfully tied to a campaign vs the unmatched bucket |
| Mart | Pre-aggregated BigQuery view at reporting grain (campaign-day, ad-group-day) |
| MERGE | BigQuery upsert — insert new rows, update existing on composite key match |
| CPL | Cost per lead — campaign spend divided by attributed lead count |
| Looker Studio | Google's BI dashboard tool; connects to BigQuery views for client-facing reports |