Marketing Attribution Pipeline — S3, BigQuery, and Looker Studio

By Amar Kumar

A proposed marketing attribution pipeline extension for a team that already moves lead and case-intake data from Lead Docket through AWS S3 into BigQuery, surfaced in Looker Studio via a scheduled Cloud Function refresh. The build would add daily Google Ads and Meta campaign ingestion, a matching layer tying each lead to its driving campaign (gclid, UTMs, source mapping), and pre-aggregated BigQuery views at campaign and ad-group grain — wired into the existing Looker dashboard with consistent styling.

Proposed outcome: One law-firm client dashboard showing which campaigns drive leads, cost per lead, lead quality, and signed/referred case breakdown by campaign — with unmatched leads handled explicitly, not dropped.

Scenario

This brief describes a proposed solution — not a delivered engagement. It maps a recurring pattern in legal marketing analytics: intake data already pipelines to BigQuery, but ad spend lives in separate silos.

Problem

The intake pipeline works — leads land in BigQuery and appear in Looker. But marketing spend data lives in Google Ads and Meta Ads Manager, so the team cannot answer basic attribution questions without manual exports and spreadsheet joins.

Requirements

Functional

Non-functional

Architecture

Three layers: ingestion (existing S3 intake unchanged + new ad platform loads), attribution warehouse (matching + marts in BigQuery), and reporting (Looker Studio on pre-aggregated views).

flowchart TB classDef ingest fill:#ede9fe,stroke:#7c3aed,color:#5b21b6 classDef bq fill:#dbeafe,stroke:#2563eb,color:#1e3a8a classDef report fill:#ccfbf1,stroke:#0d9488,color:#115e59 classDef ext fill:#f8fafc,stroke:#475569,color:#334155 LD["Lead Docket\nintake platform"]:::ext S3["AWS S3\nlead exports"]:::ingest GADS["Google Ads\nAPI / BQ transfer"]:::ext META["Meta Marketing API"]:::ext CF1["Cloud Function\nintake load — existing"]:::ingest CF2["Cloud Function\nMeta daily pull"]:::ingest SCHED["Cloud Scheduler\ndaily trigger"]:::ingest RAW["raw_leads · raw_google_ads\nraw_meta_ads"]:::bq MATCH["fact_lead_attribution\ngclid · UTM · source map"]:::bq MART["mart_campaign_daily\nmart_ad_group_daily"]:::bq MAP["dim_source_campaign_map\nclient config"]:::bq LOOKER["Looker Studio\ncampaign ROI dashboard"]:::report LD --> S3 S3 --> CF1 SCHED --> CF1 SCHED --> CF2 GADS --> RAW CF2 --> META META --> RAW CF1 --> RAW RAW --> MATCH MAP --> MATCH MATCH --> MART MART --> LOOKER

Pipeline architecture — existing S3 intake path unchanged; new ad platform loads join through an attribution matching layer into pre-aggregated marts for Looker

sequenceDiagram autonumber participant Lead as fact_leads participant Match as Attribution SQL participant Gclid as gclid lookup participant UTM as UTM + source map participant Out as fact_lead_attribution participant Mart as mart_campaign_daily Lead->>Match: new lead row Match->>Gclid: gclid present? alt gclid match Gclid->>Out: platform=google, match_method=gclid else UTM + source match Match->>UTM: utm_campaign + utm_source UTM->>Out: match_method=utm_source_map else source-only fallback UTM->>Out: match_method=source_fallback else no match Match->>Out: match_method=unmatched end Out->>Mart: aggregate by campaign + date

Attribution matching — priority cascade from gclid exact match through UTM mapping to explicit unmatched bucket; no silent drops

Component map by pipeline tier (major services per layer)

End-to-end flow

Daily cycle — ad spend and intake leads converge in BigQuery marts before surfacing in Looker Studio

Indicative lead match method distribution after attribution layer (typical legal lead-gen mix)

Typical daily pipeline duration by stage — median vs p95 (illustrative)

Recommendation: keep the existing S3 → Cloud Function → BigQuery intake path untouched; add Google Ads BigQuery Data Transfer (or API) and a Meta Marketing API Cloud Function; model attribution in BigQuery SQL views with config mapping tables; connect Looker Studio to pre-aggregated marts only.

LayerTechnologyWhy
Intake landingAWS S3 (existing)Lead Docket export path already proven; no migration needed
Intake loadCloud Function + Cloud Scheduler (existing)Matches current refresh pattern; audit before extending
Google AdsBigQuery Data Transfer ServiceZero-maintenance daily campaign metrics; supplement with API for click-level detail if needed
Meta AdsMarketing API → Cloud Function → BigQueryNo native BQ transfer for Meta; scheduled pull with pagination and rate-limit handling
WarehouseBigQueryAlready in use; partitioned raw tables, view-based marts, scheduled query refresh
Attribution logicBigQuery SQL + mapping tablesTransparent, version-controlled joins; client updates source map without redeploying SQL
ReportingLooker Studio (existing dashboard)Client already uses it; extend with new pages on mart views, not raw blends
OrchestrationCloud SchedulerDaily cadence after ad platform close; chain: ingest → match → mart refresh
MonitoringCloud Logging + optional Slack webhookRow-count checks, match-rate alerts, zero-row ingestion failures

Why mapping tables over hardcoded CASE statements? Client source-to-campaign rules will change as they add landing pages and rename campaigns. A dim_source_campaign_map table lets ops update mappings in a spreadsheet import without touching SQL.

Why pre-aggregated marts? Looker Studio blends on wide raw tables are slow and error-prone. mart_campaign_daily at the correct grain gives sub-second dashboard loads and prevents accidental many-to-many spend inflation.

Attribution matching logic (simplified SQL pattern for fact_lead_attribution):

SELECT
  l.lead_id,
  l.created_at,
  COALESCE(g.campaign_id, u.campaign_id, s.campaign_id) AS campaign_id,
  CASE
    WHEN g.campaign_id IS NOT NULL THEN 'gclid'
    WHEN u.campaign_id IS NOT NULL THEN 'utm_source_map'
    WHEN s.campaign_id IS NOT NULL THEN 'source_fallback'
    ELSE 'unmatched'
  END AS match_method
FROM fact_leads l
LEFT JOIN dim_gclid_campaign g ON l.gclid = g.gclid
LEFT JOIN dim_source_campaign_map u
  ON l.utm_source = u.source AND l.utm_campaign = u.utm_campaign
LEFT JOIN dim_source_campaign_map s
  ON l.utm_source = s.source AND s.utm_campaign IS NULL
WHERE l.client_id = @client_id

Component design

1 — Pipeline auditor & documenter

2 — Google Ads ingestion job

3 — Meta ingestion job

4 — Attribution matching engine

5 — Campaign mart views

6 — Looker Studio layer

PageMetricsSource
Campaign ROISpend, leads, CPL, impressions, CTRmart_campaign_daily
Lead qualitySigned cases, referred cases, conversion rate by campaignmart + intake join
Ad group detailSame metrics at ad-group grainmart_ad_group_daily
Unmatched leadsLeads with no campaign match; source/UTM breakdownfact_lead_attribution
Match rate trendDaily % gclid / UTM / unmatched over timefact_lead_attribution

Indicative build effort distribution by phase (% of total hours)

Implementation plan

Phase 1 — Audit & baseline (week 1)

Document existing S3 → BigQuery → Looker pipeline. Export current schemas, review Cloud Function code, inventory Looker data sources. Establish dev/staging datasets separate from production intake tables.

Risk: Undocumented Cloud Function edge cases — pair with whoever built the intake load in week 1. Rollback: read-only audit; no production changes.

Phase 2 — Ad platform ingestion (week 2–3)

Configure Google Ads BigQuery Data Transfer (or API job) into raw_google_ads. Build Meta Marketing API Cloud Function into raw_meta_ads. Wire Cloud Scheduler, row-count monitoring, 7-day backfill validation against platform UI totals.

Risk: Meta API pagination on large accounts — batch by campaign ID. Rollback: disable new schedulers; raw ad tables isolated from intake pipeline.

Phase 3 — Attribution matching layer (week 4–5)

Load client source-to-campaign mapping into dim_source_campaign_map. Build fact_lead_attribution with match priority logic. Publish match-rate KPI dashboard (% gclid, % UTM, % source fallback, % unmatched).

Risk: Low gclid capture rate in Lead Docket — quantify before finalizing match priority; UTM fallback may become primary path. Rollback: attribution views only; intake tables untouched.

Phase 4 — BigQuery marts (week 6)

Build pre-aggregated mart_campaign_daily and mart_ad_group_daily. Partition by date, cluster by campaign_id. Scheduled query refresh chained after daily ingestion completes. Grain tests to prevent spend inflation from many-to-many joins.

Risk: Cartesian join inflating spend — enforce unique grain assertions in scheduled test query. Rollback: drop mart views; raw and attribution tables remain queryable.

Phase 5 — Looker Studio & handoff (week 7–8)

Extend existing dashboard with campaign ROI, cost per lead, signed/referred breakdown, and unmatched leads pages — matching styling and naming conventions. Documentation, runbook, recorded walkthrough for internal team.

Risk: Looker blend row limits on wide joins — connect only to mart views, never raw blends. Rollback: hide new pages until client validates numbers against platform UI.

Reporting & ops

SignalSourceCadence
Ingestion row counts (Google, Meta, intake)Cloud Function logsDaily; alert on zero rows
Attribution match ratefact_lead_attributionDaily Looker tile; weekly trend review
Unmatched lead volumematch_method = 'unmatched'Weekly — indicates UTM capture gaps
Cost per lead by campaignmart_campaign_dailyReal-time in Looker
Signed case rate by campaignmart + intake quality flagsWeekly client report
Spend reconciliation vs platform UIraw tables vs Ads Manager totalsWeekly during first month; monthly after
Pipeline job failuresCloud Scheduler + error reportingImmediate Slack alert

Ops cadence would include a weekly 15-minute pipeline check (row counts, match rate, unmatched trend) and a monthly review with the client stakeholder (campaign ROI, signed case rates, mapping table updates needed). On-call alerts only for zero-row ingestion or mart refresh failure — not per-lead events.

Proposed deliverables

Following the phased plan, a build would ship these artifacts:

Effort estimate

Indicative effort for phases 1–5 (assumes access to existing pipeline, Google/Meta API credentials, and client mapping spreadsheet):

ScopeHours (range)
Phases 1–5 (full build)120–180 hrs
Ongoing maintenance (mapping updates, API changes)4–8 hrs/month
Additional law-firm client (reuse marts + mapping)20–40 hrs each

At 15–25 hrs/week, delivery would land in roughly 6–10 weeks. Milestone-based pricing aligned to phases 1–5 de-risks the engagement — each phase ships a testable increment (ingestion validated before attribution, attribution validated before Looker pages go live).

Glossary

TermMeaning
gclidGoogle Click Identifier — ties a lead session to a specific Google Ads click
fbclidMeta Click Identifier — equivalent click token for Facebook/Instagram ads
UTM parametersTracking tags (utm_source, utm_campaign, utm_medium) appended to landing page URLs
Lead DocketLegal intake platform capturing lead and case data exported to S3
Attribution matchingLogic linking a lead record to the ad campaign that drove it
Match ratePercentage of leads successfully tied to a campaign vs the unmatched bucket
MartPre-aggregated BigQuery view at reporting grain (campaign-day, ad-group-day)
MERGEBigQuery upsert — insert new rows, update existing on composite key match
CPLCost per lead — campaign spend divided by attributed lead count
Looker StudioGoogle's BI dashboard tool; connects to BigQuery views for client-facing reports