Marketing Attribution Pipeline — S3, BigQuery, and Looker Studio

By Amar Kumar

A proposed marketing attribution pipeline extension for a team that already moves lead and case-intake data from Lead Docket through AWS S3 into BigQuery, surfaced in Looker Studio via a scheduled Cloud Function refresh. The build would add daily Google Ads and Meta campaign ingestion, a matching layer tying each lead to its driving campaign (gclid, UTMs, source mapping), and pre-aggregated BigQuery views at campaign and ad-group grain — wired into the existing Looker dashboard with consistent styling.

Proposed outcome: One law-firm client dashboard showing which campaigns drive leads, cost per lead, lead quality, and signed/referred case breakdown by campaign — with unmatched leads handled explicitly, not dropped.

Scenario

This brief describes a proposed solution — not a delivered engagement. It maps a recurring pattern in legal marketing analytics: intake data already pipelines to BigQuery, but ad spend lives in separate silos.

Organization: Analytics or marketing ops team serving law-firm clients; attribution strategy, KPI definitions, and matching logic already scoped internally
Existing pipeline: Lead Docket → AWS S3 → BigQuery → Looker Studio (Cloud Function triggered by Cloud Scheduler)
Extension scope: Google Ads + Meta spend/clicks/impressions joined to intake data for one client initially, reusable for additional clients
Owner profile: Contract data engineer owning the build against a provided spec — strong SQL, ETL, and Looker experience
Attribution inputs: gclid for Google; UTMs + client-provided source-to-campaign mapping for both platforms
Downstream consumers: Client stakeholders reviewing campaign ROI, cost per lead, and signed/referred case rates in Looker Studio

Problem

The intake pipeline works — leads land in BigQuery and appear in Looker. But marketing spend data lives in Google Ads and Meta Ads Manager, so the team cannot answer basic attribution questions without manual exports and spreadsheet joins.

No spend-to-lead join — campaign cost and lead volume require manual reconciliation across platforms each week
Fragmented click identifiers — gclid on Google leads, fbclid on Meta, UTMs inconsistently captured in Lead Docket intake forms
Lead quality invisible at campaign grain — signed cases and referred outcomes exist in intake data but are not tied back to ad spend
Existing pipeline undocumented — extending S3 → BigQuery without an audit risks breaking the intake refresh the client already depends on
Looker performance — blending raw ad and lead tables at query time is slow; pre-aggregated marts at campaign grain are required
Unmatched leads — not every lead carries a clean gclid or UTM; dropping them inflates ROI; including them requires an explicit unmatched bucket with visible match rate

Requirements

Functional

Pipeline audit — document existing S3 → BigQuery → Looker flow, schemas, schedules, and Cloud Function logic before any changes
Google Ads ingestion — daily load of spend, clicks, impressions, campaign/ad-group metadata, click IDs where available
Meta ingestion — daily load of equivalent Meta campaign metrics via Marketing API or connector
Attribution matching layer — join leads to campaigns via gclid (Google), UTMs + source mapping (both platforms); graceful unmatched handling
BigQuery modeling — clean views at campaign and ad-group grain; pre-aggregated for Looker speed
Lead quality metrics — signed cases, referred cases, and quality flags broken down by campaign per scoped KPI definitions
Looker Studio extension — new pages on existing dashboard matching styling and naming conventions
Documentation & handoff — pipeline diagram, schema docs, runbook for daily ops

Non-functional

Consistency — ad data lands on the same daily schedule and naming pattern as existing intake loads
Idempotency — daily jobs safe to re-run without duplicate rows (MERGE upsert pattern)
Observability — row counts, match rates, and ingestion failures logged and alertable
Cost control — partitioned BigQuery tables, clustered on date and campaign ID
Extensibility — client-agnostic attribution SQL; client-specific rules in config mapping tables

Architecture

Three layers: ingestion (existing S3 intake unchanged + new ad platform loads), attribution warehouse (matching + marts in BigQuery), and reporting (Looker Studio on pre-aggregated views).

flowchart TB classDef ingest fill:#ede9fe,stroke:#7c3aed,color:#5b21b6 classDef bq fill:#dbeafe,stroke:#2563eb,color:#1e3a8a classDef report fill:#ccfbf1,stroke:#0d9488,color:#115e59 classDef ext fill:#f8fafc,stroke:#475569,color:#334155 LD["Lead Docket\nintake platform"]:::ext S3["AWS S3\nlead exports"]:::ingest GADS["Google Ads\nAPI / BQ transfer"]:::ext META["Meta Marketing API"]:::ext CF1["Cloud Function\nintake load — existing"]:::ingest CF2["Cloud Function\nMeta daily pull"]:::ingest SCHED["Cloud Scheduler\ndaily trigger"]:::ingest RAW["raw_leads · raw_google_ads\nraw_meta_ads"]:::bq MATCH["fact_lead_attribution\ngclid · UTM · source map"]:::bq MART["mart_campaign_daily\nmart_ad_group_daily"]:::bq MAP["dim_source_campaign_map\nclient config"]:::bq LOOKER["Looker Studio\ncampaign ROI dashboard"]:::report LD --> S3 S3 --> CF1 SCHED --> CF1 SCHED --> CF2 GADS --> RAW CF2 --> META META --> RAW CF1 --> RAW RAW --> MATCH MAP --> MATCH MATCH --> MART MART --> LOOKER

Pipeline architecture — existing S3 intake path unchanged; new ad platform loads join through an attribution matching layer into pre-aggregated marts for Looker

sequenceDiagram autonumber participant Lead as fact_leads participant Match as Attribution SQL participant Gclid as gclid lookup participant UTM as UTM + source map participant Out as fact_lead_attribution participant Mart as mart_campaign_daily Lead->>Match: new lead row Match->>Gclid: gclid present? alt gclid match Gclid->>Out: platform=google, match_method=gclid else UTM + source match Match->>UTM: utm_campaign + utm_source UTM->>Out: match_method=utm_source_map else source-only fallback UTM->>Out: match_method=source_fallback else no match Match->>Out: match_method=unmatched end Out->>Mart: aggregate by campaign + date

Attribution matching — priority cascade from gclid exact match through UTM mapping to explicit unmatched bucket; no silent drops

Component map by pipeline tier (major services per layer)

End-to-end flow

Ingest ads → Match leads → Build marts → Looker views → Client review

Daily cycle — ad spend and intake leads converge in BigQuery marts before surfacing in Looker Studio

Indicative lead match method distribution after attribution layer (typical legal lead-gen mix)

Typical daily pipeline duration by stage — median vs p95 (illustrative)

Recommended stack

Recommendation: keep the existing S3 → Cloud Function → BigQuery intake path untouched; add Google Ads BigQuery Data Transfer (or API) and a Meta Marketing API Cloud Function; model attribution in BigQuery SQL views with config mapping tables; connect Looker Studio to pre-aggregated marts only.

Layer	Technology	Why
Intake landing	AWS S3 (existing)	Lead Docket export path already proven; no migration needed
Intake load	Cloud Function + Cloud Scheduler (existing)	Matches current refresh pattern; audit before extending
Google Ads	BigQuery Data Transfer Service	Zero-maintenance daily campaign metrics; supplement with API for click-level detail if needed
Meta Ads	Marketing API → Cloud Function → BigQuery	No native BQ transfer for Meta; scheduled pull with pagination and rate-limit handling
Warehouse	BigQuery	Already in use; partitioned raw tables, view-based marts, scheduled query refresh
Attribution logic	BigQuery SQL + mapping tables	Transparent, version-controlled joins; client updates source map without redeploying SQL
Reporting	Looker Studio (existing dashboard)	Client already uses it; extend with new pages on mart views, not raw blends
Orchestration	Cloud Scheduler	Daily cadence after ad platform close; chain: ingest → match → mart refresh
Monitoring	Cloud Logging + optional Slack webhook	Row-count checks, match-rate alerts, zero-row ingestion failures

Why mapping tables over hardcoded CASE statements? Client source-to-campaign rules will change as they add landing pages and rename campaigns. A dim_source_campaign_map table lets ops update mappings in a spreadsheet import without touching SQL.

Why pre-aggregated marts? Looker Studio blends on wide raw tables are slow and error-prone. mart_campaign_daily at the correct grain gives sub-second dashboard loads and prevents accidental many-to-many spend inflation.

Attribution matching logic (simplified SQL pattern for fact_lead_attribution):

SELECT
  l.lead_id,
  l.created_at,
  COALESCE(g.campaign_id, u.campaign_id, s.campaign_id) AS campaign_id,
  CASE
    WHEN g.campaign_id IS NOT NULL THEN 'gclid'
    WHEN u.campaign_id IS NOT NULL THEN 'utm_source_map'
    WHEN s.campaign_id IS NOT NULL THEN 'source_fallback'
    ELSE 'unmatched'
  END AS match_method
FROM fact_leads l
LEFT JOIN dim_gclid_campaign g ON l.gclid = g.gclid
LEFT JOIN dim_source_campaign_map u
  ON l.utm_source = u.source AND l.utm_campaign = u.utm_campaign
LEFT JOIN dim_source_campaign_map s
  ON l.utm_source = s.source AND s.utm_campaign IS NULL
WHERE l.client_id = @client_id

Component design

1 — Pipeline auditor & documenter

Input: existing S3 buckets, Cloud Function source, BigQuery datasets, Looker data sources
Output: architecture diagram, schema catalog, schedule matrix, dependency map
QA gate: stakeholder sign-off before any schema changes to intake tables

2 — Google Ads ingestion job

Input: Google Ads account credentials, daily incremental date range
Output: raw_google_ads partitioned table — campaign_id, ad_group_id, date, spend, clicks, impressions
Schedule: daily after midnight account timezone; idempotent MERGE on date + campaign + ad_group

3 — Meta ingestion job

Input: Meta Marketing API token, ad account ID
Output: raw_meta_ads with parallel schema to Google for downstream union
Schedule: same daily window; exponential backoff on rate limits; batch by campaign for large accounts

4 — Attribution matching engine

Input: fact_leads, raw_google_ads, raw_meta_ads, dim_source_campaign_map
Matching priority: (1) gclid exact match, (2) UTM campaign + source via mapping table, (3) source-only fallback, (4) explicit unmatched bucket
Output: fact_lead_attribution with match_method, match_confidence, campaign_id, platform
Edge cases: duplicate gclids, expired click IDs, partial UTMs — each gets explicit match_method, never silent drop

5 — Campaign mart views

Grain: campaign-day and ad-group-day pre-aggregates
Metrics: spend, clicks, impressions, leads, cost per lead, signed cases, referred cases, match rate
Output: mart_campaign_daily, mart_ad_group_daily — Looker connects here only

6 — Looker Studio layer

Page	Metrics	Source
Campaign ROI	Spend, leads, CPL, impressions, CTR	`mart_campaign_daily`
Lead quality	Signed cases, referred cases, conversion rate by campaign	mart + intake join
Ad group detail	Same metrics at ad-group grain	`mart_ad_group_daily`
Unmatched leads	Leads with no campaign match; source/UTM breakdown	`fact_lead_attribution`
Match rate trend	Daily % gclid / UTM / unmatched over time	`fact_lead_attribution`

Indicative build effort distribution by phase (% of total hours)

Implementation plan

Phase 1 — Audit & baseline (week 1)

Document existing S3 → BigQuery → Looker pipeline. Export current schemas, review Cloud Function code, inventory Looker data sources. Establish dev/staging datasets separate from production intake tables.

Risk: Undocumented Cloud Function edge cases — pair with whoever built the intake load in week 1. Rollback: read-only audit; no production changes.

Phase 2 — Ad platform ingestion (week 2–3)

Configure Google Ads BigQuery Data Transfer (or API job) into raw_google_ads. Build Meta Marketing API Cloud Function into raw_meta_ads. Wire Cloud Scheduler, row-count monitoring, 7-day backfill validation against platform UI totals.

Risk: Meta API pagination on large accounts — batch by campaign ID. Rollback: disable new schedulers; raw ad tables isolated from intake pipeline.

Phase 3 — Attribution matching layer (week 4–5)

Load client source-to-campaign mapping into dim_source_campaign_map. Build fact_lead_attribution with match priority logic. Publish match-rate KPI dashboard (% gclid, % UTM, % source fallback, % unmatched).

Risk: Low gclid capture rate in Lead Docket — quantify before finalizing match priority; UTM fallback may become primary path. Rollback: attribution views only; intake tables untouched.

Phase 4 — BigQuery marts (week 6)

Build pre-aggregated mart_campaign_daily and mart_ad_group_daily. Partition by date, cluster by campaign_id. Scheduled query refresh chained after daily ingestion completes. Grain tests to prevent spend inflation from many-to-many joins.

Risk: Cartesian join inflating spend — enforce unique grain assertions in scheduled test query. Rollback: drop mart views; raw and attribution tables remain queryable.

Phase 5 — Looker Studio & handoff (week 7–8)

Extend existing dashboard with campaign ROI, cost per lead, signed/referred breakdown, and unmatched leads pages — matching styling and naming conventions. Documentation, runbook, recorded walkthrough for internal team.

Risk: Looker blend row limits on wide joins — connect only to mart views, never raw blends. Rollback: hide new pages until client validates numbers against platform UI.

Reporting & ops

Signal	Source	Cadence
Ingestion row counts (Google, Meta, intake)	Cloud Function logs	Daily; alert on zero rows
Attribution match rate	`fact_lead_attribution`	Daily Looker tile; weekly trend review
Unmatched lead volume	match_method = 'unmatched'	Weekly — indicates UTM capture gaps
Cost per lead by campaign	`mart_campaign_daily`	Real-time in Looker
Signed case rate by campaign	mart + intake quality flags	Weekly client report
Spend reconciliation vs platform UI	raw tables vs Ads Manager totals	Weekly during first month; monthly after
Pipeline job failures	Cloud Scheduler + error reporting	Immediate Slack alert

Ops cadence would include a weekly 15-minute pipeline check (row counts, match rate, unmatched trend) and a monthly review with the client stakeholder (campaign ROI, signed case rates, mapping table updates needed). On-call alerts only for zero-row ingestion or mart refresh failure — not per-lead events.

Proposed deliverables

Following the phased plan, a build would ship these artifacts:

Pipeline audit document with architecture diagram, schema catalog, and schedule matrix
Daily Google Ads ingestion into BigQuery (Data Transfer or API)
Daily Meta Ads ingestion Cloud Function with rate-limit handling
dim_source_campaign_map config table and spreadsheet loader
fact_lead_attribution matching layer with explicit unmatched bucket and match_method column
mart_campaign_daily and mart_ad_group_daily pre-aggregated views
Looker Studio pages on existing dashboard — campaign ROI, lead quality, unmatched report, match rate trend
Runbook: daily ops, re-run failed jobs, update source mappings, onboard additional client
Recorded handoff walkthrough for internal team

Effort estimate

Indicative effort for phases 1–5 (assumes access to existing pipeline, Google/Meta API credentials, and client mapping spreadsheet):

Scope	Hours (range)
Phases 1–5 (full build)	120–180 hrs
Ongoing maintenance (mapping updates, API changes)	4–8 hrs/month
Additional law-firm client (reuse marts + mapping)	20–40 hrs each

At 15–25 hrs/week, delivery would land in roughly 6–10 weeks. Milestone-based pricing aligned to phases 1–5 de-risks the engagement — each phase ships a testable increment (ingestion validated before attribution, attribution validated before Looker pages go live).

Glossary

Term	Meaning
gclid	Google Click Identifier — ties a lead session to a specific Google Ads click
fbclid	Meta Click Identifier — equivalent click token for Facebook/Instagram ads
UTM parameters	Tracking tags (utm_source, utm_campaign, utm_medium) appended to landing page URLs
Lead Docket	Legal intake platform capturing lead and case data exported to S3
Attribution matching	Logic linking a lead record to the ad campaign that drove it
Match rate	Percentage of leads successfully tied to a campaign vs the unmatched bucket
Mart	Pre-aggregated BigQuery view at reporting grain (campaign-day, ad-group-day)
MERGE	BigQuery upsert — insert new rows, update existing on composite key match
CPL	Cost per lead — campaign spend divided by attributed lead count
Looker Studio	Google's BI dashboard tool; connects to BigQuery views for client-facing reports