Production Property-Records Search Platform for NYC Open Data

By Amar Kumar

A proposed property records search platform for a New York real-estate services company that already pulls municipal data from NYC's open-data ecosystem. The build would wrap the existing data engine in a production layer — an internal ordering portal, a reconciliation pipeline across 80+ official datasets with verify-links on every result line, human-in-the-loop review before finalization, and branded PDF delivery tracked by internal file number.

Proposed outcome: A daily-use internal product that replaces third-party vendor searches for NYC municipal/departmental records — turning raw open data into certified, client-ready reports with one-click source verification on every line item.

Scenario

This brief describes a proposed solution — not a delivered engagement. It maps a pattern common among title and settlement shops: proven internal data pulls, but no production portal to order, verify, and deliver searches in-house.

Organization: Established NYC real-estate services company with two decades of title and settlement experience and a serious internal automation stack (AI pipelines, CRM intelligence, financial systems)
Existing asset: Working data engine that already queries NYC open-data sources — the build extends this engine, not replaces it
Owner profile: Expert full-stack engineer (or tight lead + helper) owning backend pipeline and internal portal UI; long-term multi-phase partnership
Geography: NYC five boroughs for MVP; surrounding counties in later phases (fundamentally different data model — more intake automation than direct API pulls)
Users: Production team ordering searches against properties and internal file numbers, reviewing results, signing off before delivery
Liability context: Search outputs carry certification implications — auto-publish without human gate is unacceptable

Problem

The company pays third-party vendors for municipal and title searches. The underlying data is largely free from official NYC and NY State sources, and internal proof-of-concept pulls already work. What is missing is the production layer around that data.

Vendor cost and latency — recurring fees for data accessible directly; turnaround tied to external vendor queues
No unified workflow — orders, status, review, and delivery scattered across email, vendor portals, and spreadsheets
Messy government data — 80+ datasets with inconsistent schemas, pagination, rate limits, stale records, and occasional errors
Provenance gap — vendor PDFs rarely expose live links back to the official source record for line-by-line verification
Property key fragmentation — datasets join on address, BBL (borough/block/lot), parcel ID, or agency-specific identifiers inconsistently
Liability without gates — publishing a search without human review and confidence scoring creates certification risk
No file-number spine — internal title-production software tracks files; search results must map to that identifier for write-back in later phases

Requirements

Functional (Phase 1 MVP)

Ordering portal — order a municipal/departmental search against a property and internal file number; select search types (COO, violations across DOB/HPD/ECB, tax/water/sewer, fire, housing, zoning, environmental, etc.)
BBL resolution spine — address → geocode → borough/block/lot as the canonical property key tying every dataset
Data pipeline — resilient async jobs querying Socrata/SODA, ArcGIS REST, and agency record systems; normalize into one result set per search
Verify-link on every line — each result item includes a live URL to the official source record (non-negotiable product requirement)
Reconciliation — merge duplicates, flag conflicts between sources, preserve raw provenance
Status dashboard — searches grouped under files with green / needs-review / problem states
Human-in-the-loop review — confidence scoring + sign-off gate before finalization
Branded PDF reports — dynamic generation stored and retrievable by file number
Audit trail — who ordered, when data was fetched, reviewer sign-off, PDF version history

Functional (later phases)

Title-chain module — recorded documents, judgments, liens, lis pendens, tax warrants, estate/probate, owner-name searches
Write-back into existing title-production software via its API
Slack-based approval and notification workflows
Surrounding county expansion where no unified open-data API exists

Non-functional

Reliability — 80+ external APIs must not silently break; health checks, schema drift alerts, circuit breakers
Provenance — immutable fetch metadata (source API, query, timestamp, raw payload hash) per result line
Security — role-scoped portal access; secrets outside repo; R2 signed URLs for PDFs
Performance — queue-backed parallel fetches; search completes in minutes, not hours
Observability — per-dataset success rates, latency, error taxonomy
Extensibility — new datasets added via config + adapter module, not forked pipeline code

Architecture

Three tiers: a portal + workflow layer (React + Workers API), an async data plane (Queues + fetch workers wrapping the existing engine), and a provenance + delivery layer (D1 registry, R2 PDFs, review queue).

flowchart TB classDef portal fill:#ede9fe,stroke:#7c3aed,color:#5b21b6 classDef worker fill:#dbeafe,stroke:#2563eb,color:#1e3a8a classDef data fill:#ccfbf1,stroke:#0d9488,color:#115e59 classDef ext fill:#f8fafc,stroke:#475569,color:#334155 UI["React portal\norder · dashboard · review"]:::portal API["Workers API\nauth · workflow · registry"]:::portal UI --> API BBL["BBL resolver\nGeoclient · PLUTO"]:::worker ORCH["Fetch orchestrator\nQueue fan-out"]:::worker RECON["Reconcile + provenance\nnormalize · dedupe · verify URL"]:::worker PDF["PDF generator\nbranded · per-line links"]:::worker API --> BBL BBL --> ORCH ORCH --> RECON RECON --> PDF D1[("D1 registry\nfiles · searches · lines")]:::data R2["R2 storage\nPDFs · raw snapshots"]:::data CRON["Cron health checks\nschema drift alerts"]:::data API --> D1 RECON --> D1 PDF --> R2 CRON --> ORCH ENG["Existing data engine\nadapter — do not rebuild"]:::worker ORCH --> ENG SODA["NYC Open Data\nSODA · ArcGIS"]:::ext ACRIS["ACRIS · agency systems"]:::ext ENG --> SODA ENG --> ACRIS

Platform architecture — portal and Workers API orchestrate BBL resolution, queue-backed fetches via the existing engine, reconciliation with provenance, and PDF delivery

sequenceDiagram autonumber participant User as Production team participant Portal as React portal participant API as Workers API participant BBL as BBL resolver participant Q as Queues participant Adp as Dataset adapters participant Gov as NYC open data participant Rev as Review queue participant PDF as PDF service User->>Portal: order search (file #, address, types) Portal->>API: POST /searches API->>BBL: resolve address to BBL BBL->>API: BBL + PLUTO attrs API->>Q: enqueue N dataset jobs loop parallel fetches Q->>Adp: consume job Adp->>Gov: SODA / ArcGIS / agency API Gov->>Adp: raw records Adp->>API: result lines + provenance end API->>API: reconcile + confidence score alt needs review API->>Rev: queue low-confidence lines Rev->>User: notify reviewer User->>Rev: sign off end API->>PDF: generate branded report PDF->>API: R2 URL + version API->>Portal: status delivered

Search execution — parallel queue fetches, reconciliation with confidence scoring, mandatory review gate, then PDF delivery with verify-links

Component map by platform tier (major services per layer)

End-to-end flow

Order → BBL resolve → Fetch 80+ sources → Reconcile → Review → PDF deliver

Municipal search lifecycle — every result line carries a verify-link; nothing finalizes without reviewer sign-off

Indicative phase-1 dataset coverage by category (% of 80+ source adapters)

Typical search lifecycle — median vs p95 duration by stage (illustrative)

Recommended stack

Recommendation: Cloudflare Workers for compute, D1 for the relational registry, R2 for PDFs and raw snapshots, Queues for parallel dataset fetches, and a React portal on Cloudflare Pages — wrapping the existing data engine via adapter, not rebuild.

Layer	Technology	Why
Compute	Cloudflare Workers	Edge-native, consistent with client's newer systems; fast cold start for API routes and webhook handlers
Database	D1 (SQLite)	Search orders, file registry, result lines, provenance, review state — relational model fits audit and dashboard queries
Object storage	R2	PDF reports, raw fetch snapshots, large JSON payloads; no egress fees to Workers
Async jobs	Queues + Cron Triggers	Parallel dataset pulls; scheduled health checks and schema drift detection
Frontend	React (Vite) on Pages	Internal portal — order form, file-grouped dashboard, review UI with verify-link previews
PDF	React-PDF or Browser Rendering API	Branded templates with clickable verify-link per result line
Geocoding	NYC Geoclient / GeoSearch + PLUTO	Official NYC address → BBL resolution; tax lot attributes for cross-dataset joins
Existing engine	Worker service binding or HTTP adapter	Preserve proven pull logic — wrap, do not rewrite from scratch
Auth	Cloudflare Access	SSO for internal production team; no custom auth to maintain

Why Cloudflare over AWS Lambda? The organization has standardized on Cloudflare for newer systems — Workers, D1, R2, and Queues co-locate without cross-service latency for metadata joins. AWS Lambda + DynamoDB + S3 remains viable for teams with deep existing investment, but stack consistency reduces ops surface here.

Why D1 over DynamoDB? Search results are inherently relational (files → searches → result_lines → provenance_records). SQL filters for the status dashboard and certification audit reports are simpler than single-table Dynamo patterns.

Provenance record shape (every result line would carry this metadata in D1):

interface ProvenanceRecord {
  id: string;
  result_line_id: string;
  source_api: string;       // e.g. "data.cityofnewyork.us/resource/..."
  source_record_id: string; // agency-native ID for URL rebuild
  source_url: string;       // live verify-link (required)
  query_params: string;     // JSON: BBL, address, date range used
  fetched_at: string;       // ISO timestamp
  payload_hash: string;     // SHA-256 of raw response
  adapter_version: string;  // pin schema expectations
}

Component design

1 — Order & workflow service

Input: file number, address or BBL, search type checklist, requester ID
Output: search_id, status transitions (queued → fetching → reconciling → review → approved → delivered)
QA gate: validate file number format; reject duplicate in-flight search for same file + property + type

2 — Property key resolver

Input: street address + borough, or raw BBL
Output: normalized BBL, lat/lon, PLUTO attributes (building class, units, land use)
Tools: Geoclient API, PLUTO dataset join, fallback manual BBL entry flagged for reviewer

3 — Dataset fetch orchestrator

Input: BBL + enabled dataset adapters for search type
Output: raw payloads per adapter; fetch metadata logged to D1
Pattern: Queue consumer per adapter family (SODA batch, ArcGIS REST, engine-backed) with shared retry, backoff, and circuit-breaker config

4 — Reconciliation & provenance engine

Input: raw adapter results
Output: normalized result_lines with source_url, source_api, fetched_at, confidence_factors
Rules: dedupe by agency record ID; flag address mismatches against BBL; never drop provenance on merge

5 — Confidence scorer + review queue

Input: reconciled result set
Output: per-line and per-search confidence score; route conflicts and low-confidence items to review inbox
QA gate: no PDF generation until reviewer sign-off — search outputs carry liability

6 — PDF report generator

Input: approved result set + branding template
Output: PDF in R2; verify-link rendered as clickable URL per line
Versioning: new PDF version on re-run; prior versions retained for audit

7 — File & search registry

Table	Purpose
`files`	Internal file numbers; link to title-production system ID (phase 2 write-back)
`searches`	Order metadata, BBL, status, requester, timestamps
`result_lines`	Normalized findings with category, severity, verify-link
`provenance_records`	Immutable fetch metadata per line
`review_events`	Reviewer actions, notes, sign-off timestamps
`pdf_artifacts`	R2 keys, version, generated_at

Indicative MVP effort distribution by phase (% of total hours)

Implementation plan

Phase 1 — Foundation & engine wrap (week 1–3)

Provision Workers, D1 schema, R2 buckets, Queues. HTTP adapter wrapping existing data engine with health endpoint. BBL resolver service with PLUTO join. Basic REST API: create search, get status, list by file number.

Risk: Undocumented engine APIs — schedule pairing sessions with internal team in week 1. Rollback: read-only portal against engine without write path.

Phase 2 — Dataset adapter library (week 4–7)

Implement shared adapter interface; port highest-priority datasets (DOB violations, HPD, ECB, tax/lien, COO, zoning, environmental). Shared SODA client with pagination, rate-limit handling, schema version pins. Cron health checks per dataset with Slack alerts on failure.

Risk: API schema drift — pin expected columns; alert on missing fields before silent empty results. Rollback: disable adapter via feature flag without stopping other datasets.

Phase 3 — Portal UI (week 8–10)

React portal: order form with address autocomplete, file-grouped dashboard (green / needs-review / problem), search detail view with result lines and verify-link previews. Cloudflare Access for SSO.

Risk: UX friction slows production team adoption — embed reviewers in weekly UAT. Rollback: API-only mode for power users until UI polished.

Phase 4 — Reconciliation & provenance (week 11–13)

Normalization rules, conflict detection, immutable provenance store. Verify-link validator (HTTP HEAD check that source URL resolves). Conflict UI showing side-by-side source records.

Risk: Source URLs change format — store agency record ID separately so links can be rebuilt. Rollback: show raw links even if validator fails; flag for manual review.

Phase 5 — Review gate & PDF delivery (week 14–16)

Review queue UI, sign-off workflow, confidence scoring v1. Branded PDF template with per-line verify links and file number header. R2 storage + retrieval by file number from portal.

Risk: PDF layout breaks on long violation lists — paginate with continuation sheets. Rollback: HTML report fallback until PDF template stable.

Phase 6 — Hardening & title-chain prep (week 17–20)

Load testing on 80+ parallel fetches, ops dashboards, runbooks, on-call alerts for dataset outages. Document adapter-add guide for internal team. Spike ACRIS and title-chain data sources for phase 2 roadmap; draft write-back API spec for title-production software.

Risk: Silent pipeline degradation over time — mandatory weekly dataset health report. Rollback: vendor fallback path for critical search types until confidence KPIs met.

Reporting & ops

Signal	Source	Cadence
Search completion time	`searches` timestamps	Real-time dashboard
Per-dataset fetch success rate	Adapter Worker logs	Daily; alert if below 95%
Review queue depth	`review_events`	Alert if > N items or > 24h unreviewed
Verify-link broken rate	Link validator cron job	Weekly trend report
API schema drift	Health check cron	Immediate Slack alert
PDF generation failures	Worker error logs	Real-time alert
Vendor vs in-house cost	Order volume × vendor rate	Monthly leadership summary

Ops cadence would include a weekly 30-minute pipeline standup (failed datasets, review backlog, schema changes) and a monthly review with leadership (search volume, turnaround time, vendor cost avoided, phase 2 readiness). On-call rotation for prod Worker errors and dataset-wide outage only — not per-search alerts.

Proposed deliverables

Following the phased plan, a build would ship these artifacts:

Cloudflare Workers API with D1 schema, R2 buckets, and Queue topology documented
Adapter library for 80+ NYC open-data sources with shared provenance model
BBL resolution service (Geoclient + PLUTO join)
HTTP adapter layer wrapping the existing data engine (no from-scratch rebuild)
React internal portal — order form, file-grouped status dashboard, review UI, PDF download
Human-in-the-loop review queue with confidence scoring and sign-off audit trail
Branded PDF report generator with verify-link on every result line
Ops dashboard and per-dataset health monitoring with schema drift alerts
Runbooks: add new dataset adapter, handle API outage, reviewer workflow, credential rotation
Integration specification for title-production write-back (phase 2)

Effort estimate

Indicative effort for MVP through phase 6 (assumes access to existing engine, API credentials for NYC data sources, and production team available for UAT):

Scope	Hours (range)
Phase 1 MVP (phases 1–5)	280–380 hrs
Phase 6 hardening & title-chain spike	40–60 hrs
Title-chain module (later phase)	120–180 hrs
Write-back + Slack workflows (later phase)	60–90 hrs
Ongoing maintenance (adapter updates, health monitoring)	10–16 hrs/month

At 25–30 hrs/week, MVP delivery would land in roughly 10–14 weeks. Milestone-based pricing aligned to phases 1–5 would de-risk the engagement for both parties — each phase ships a testable increment before the next begins.

Glossary

Term	Meaning
BBL	Borough-Block-Lot — NYC's canonical parcel identifier tying datasets together
PLUTO	Primary Land Use Tax Lot Output — NYC tax lot dataset with building and land use attributes
ACRIS	Automated City Register Information System — recorded documents (deeds, mortgages, etc.)
SODA / Socrata	NYC Open Data API platform used by most municipal datasets
COO	Certificate of Occupancy — DOB document confirming legal use of a building
HPD / DOB / ECB	Housing Preservation, Department of Buildings, Environmental Control Board — major violation systems
Provenance	Metadata tracing each result line to its official source fetch (API, query, timestamp, hash)
Verify-link	Live URL to the official government record — one-click human confirmation of any line item
HITL	Human-in-the-loop — reviewer sign-off gate before a search is finalized and delivered
Geoclient	NYC official geocoding service for address → BBL resolution