Production Property-Records Search Platform for NYC Open Data

By Amar Kumar

A proposed property records search platform for a New York real-estate services company that already pulls municipal data from NYC's open-data ecosystem. The build would wrap the existing data engine in a production layer — an internal ordering portal, a reconciliation pipeline across 80+ official datasets with verify-links on every result line, human-in-the-loop review before finalization, and branded PDF delivery tracked by internal file number.

Proposed outcome: A daily-use internal product that replaces third-party vendor searches for NYC municipal/departmental records — turning raw open data into certified, client-ready reports with one-click source verification on every line item.

Scenario

This brief describes a proposed solution — not a delivered engagement. It maps a pattern common among title and settlement shops: proven internal data pulls, but no production portal to order, verify, and deliver searches in-house.

Problem

The company pays third-party vendors for municipal and title searches. The underlying data is largely free from official NYC and NY State sources, and internal proof-of-concept pulls already work. What is missing is the production layer around that data.

Requirements

Functional (Phase 1 MVP)

Functional (later phases)

Non-functional

Architecture

Three tiers: a portal + workflow layer (React + Workers API), an async data plane (Queues + fetch workers wrapping the existing engine), and a provenance + delivery layer (D1 registry, R2 PDFs, review queue).

flowchart TB classDef portal fill:#ede9fe,stroke:#7c3aed,color:#5b21b6 classDef worker fill:#dbeafe,stroke:#2563eb,color:#1e3a8a classDef data fill:#ccfbf1,stroke:#0d9488,color:#115e59 classDef ext fill:#f8fafc,stroke:#475569,color:#334155 UI["React portal\norder · dashboard · review"]:::portal API["Workers API\nauth · workflow · registry"]:::portal UI --> API BBL["BBL resolver\nGeoclient · PLUTO"]:::worker ORCH["Fetch orchestrator\nQueue fan-out"]:::worker RECON["Reconcile + provenance\nnormalize · dedupe · verify URL"]:::worker PDF["PDF generator\nbranded · per-line links"]:::worker API --> BBL BBL --> ORCH ORCH --> RECON RECON --> PDF D1[("D1 registry\nfiles · searches · lines")]:::data R2["R2 storage\nPDFs · raw snapshots"]:::data CRON["Cron health checks\nschema drift alerts"]:::data API --> D1 RECON --> D1 PDF --> R2 CRON --> ORCH ENG["Existing data engine\nadapter — do not rebuild"]:::worker ORCH --> ENG SODA["NYC Open Data\nSODA · ArcGIS"]:::ext ACRIS["ACRIS · agency systems"]:::ext ENG --> SODA ENG --> ACRIS

Platform architecture — portal and Workers API orchestrate BBL resolution, queue-backed fetches via the existing engine, reconciliation with provenance, and PDF delivery

sequenceDiagram autonumber participant User as Production team participant Portal as React portal participant API as Workers API participant BBL as BBL resolver participant Q as Queues participant Adp as Dataset adapters participant Gov as NYC open data participant Rev as Review queue participant PDF as PDF service User->>Portal: order search (file #, address, types) Portal->>API: POST /searches API->>BBL: resolve address to BBL BBL->>API: BBL + PLUTO attrs API->>Q: enqueue N dataset jobs loop parallel fetches Q->>Adp: consume job Adp->>Gov: SODA / ArcGIS / agency API Gov->>Adp: raw records Adp->>API: result lines + provenance end API->>API: reconcile + confidence score alt needs review API->>Rev: queue low-confidence lines Rev->>User: notify reviewer User->>Rev: sign off end API->>PDF: generate branded report PDF->>API: R2 URL + version API->>Portal: status delivered

Search execution — parallel queue fetches, reconciliation with confidence scoring, mandatory review gate, then PDF delivery with verify-links

Component map by platform tier (major services per layer)

End-to-end flow

Municipal search lifecycle — every result line carries a verify-link; nothing finalizes without reviewer sign-off

Indicative phase-1 dataset coverage by category (% of 80+ source adapters)

Typical search lifecycle — median vs p95 duration by stage (illustrative)

Recommendation: Cloudflare Workers for compute, D1 for the relational registry, R2 for PDFs and raw snapshots, Queues for parallel dataset fetches, and a React portal on Cloudflare Pages — wrapping the existing data engine via adapter, not rebuild.

LayerTechnologyWhy
ComputeCloudflare WorkersEdge-native, consistent with client's newer systems; fast cold start for API routes and webhook handlers
DatabaseD1 (SQLite)Search orders, file registry, result lines, provenance, review state — relational model fits audit and dashboard queries
Object storageR2PDF reports, raw fetch snapshots, large JSON payloads; no egress fees to Workers
Async jobsQueues + Cron TriggersParallel dataset pulls; scheduled health checks and schema drift detection
FrontendReact (Vite) on PagesInternal portal — order form, file-grouped dashboard, review UI with verify-link previews
PDFReact-PDF or Browser Rendering APIBranded templates with clickable verify-link per result line
GeocodingNYC Geoclient / GeoSearch + PLUTOOfficial NYC address → BBL resolution; tax lot attributes for cross-dataset joins
Existing engineWorker service binding or HTTP adapterPreserve proven pull logic — wrap, do not rewrite from scratch
AuthCloudflare AccessSSO for internal production team; no custom auth to maintain

Why Cloudflare over AWS Lambda? The organization has standardized on Cloudflare for newer systems — Workers, D1, R2, and Queues co-locate without cross-service latency for metadata joins. AWS Lambda + DynamoDB + S3 remains viable for teams with deep existing investment, but stack consistency reduces ops surface here.

Why D1 over DynamoDB? Search results are inherently relational (files → searches → result_lines → provenance_records). SQL filters for the status dashboard and certification audit reports are simpler than single-table Dynamo patterns.

Provenance record shape (every result line would carry this metadata in D1):

interface ProvenanceRecord {
  id: string;
  result_line_id: string;
  source_api: string;       // e.g. "data.cityofnewyork.us/resource/..."
  source_record_id: string; // agency-native ID for URL rebuild
  source_url: string;       // live verify-link (required)
  query_params: string;     // JSON: BBL, address, date range used
  fetched_at: string;       // ISO timestamp
  payload_hash: string;     // SHA-256 of raw response
  adapter_version: string;  // pin schema expectations
}

Component design

1 — Order & workflow service

2 — Property key resolver

3 — Dataset fetch orchestrator

4 — Reconciliation & provenance engine

5 — Confidence scorer + review queue

6 — PDF report generator

7 — File & search registry

TablePurpose
filesInternal file numbers; link to title-production system ID (phase 2 write-back)
searchesOrder metadata, BBL, status, requester, timestamps
result_linesNormalized findings with category, severity, verify-link
provenance_recordsImmutable fetch metadata per line
review_eventsReviewer actions, notes, sign-off timestamps
pdf_artifactsR2 keys, version, generated_at

Indicative MVP effort distribution by phase (% of total hours)

Implementation plan

Phase 1 — Foundation & engine wrap (week 1–3)

Provision Workers, D1 schema, R2 buckets, Queues. HTTP adapter wrapping existing data engine with health endpoint. BBL resolver service with PLUTO join. Basic REST API: create search, get status, list by file number.

Risk: Undocumented engine APIs — schedule pairing sessions with internal team in week 1. Rollback: read-only portal against engine without write path.

Phase 2 — Dataset adapter library (week 4–7)

Implement shared adapter interface; port highest-priority datasets (DOB violations, HPD, ECB, tax/lien, COO, zoning, environmental). Shared SODA client with pagination, rate-limit handling, schema version pins. Cron health checks per dataset with Slack alerts on failure.

Risk: API schema drift — pin expected columns; alert on missing fields before silent empty results. Rollback: disable adapter via feature flag without stopping other datasets.

Phase 3 — Portal UI (week 8–10)

React portal: order form with address autocomplete, file-grouped dashboard (green / needs-review / problem), search detail view with result lines and verify-link previews. Cloudflare Access for SSO.

Risk: UX friction slows production team adoption — embed reviewers in weekly UAT. Rollback: API-only mode for power users until UI polished.

Phase 4 — Reconciliation & provenance (week 11–13)

Normalization rules, conflict detection, immutable provenance store. Verify-link validator (HTTP HEAD check that source URL resolves). Conflict UI showing side-by-side source records.

Risk: Source URLs change format — store agency record ID separately so links can be rebuilt. Rollback: show raw links even if validator fails; flag for manual review.

Phase 5 — Review gate & PDF delivery (week 14–16)

Review queue UI, sign-off workflow, confidence scoring v1. Branded PDF template with per-line verify links and file number header. R2 storage + retrieval by file number from portal.

Risk: PDF layout breaks on long violation lists — paginate with continuation sheets. Rollback: HTML report fallback until PDF template stable.

Phase 6 — Hardening & title-chain prep (week 17–20)

Load testing on 80+ parallel fetches, ops dashboards, runbooks, on-call alerts for dataset outages. Document adapter-add guide for internal team. Spike ACRIS and title-chain data sources for phase 2 roadmap; draft write-back API spec for title-production software.

Risk: Silent pipeline degradation over time — mandatory weekly dataset health report. Rollback: vendor fallback path for critical search types until confidence KPIs met.

Reporting & ops

SignalSourceCadence
Search completion timesearches timestampsReal-time dashboard
Per-dataset fetch success rateAdapter Worker logsDaily; alert if below 95%
Review queue depthreview_eventsAlert if > N items or > 24h unreviewed
Verify-link broken rateLink validator cron jobWeekly trend report
API schema driftHealth check cronImmediate Slack alert
PDF generation failuresWorker error logsReal-time alert
Vendor vs in-house costOrder volume × vendor rateMonthly leadership summary

Ops cadence would include a weekly 30-minute pipeline standup (failed datasets, review backlog, schema changes) and a monthly review with leadership (search volume, turnaround time, vendor cost avoided, phase 2 readiness). On-call rotation for prod Worker errors and dataset-wide outage only — not per-search alerts.

Proposed deliverables

Following the phased plan, a build would ship these artifacts:

Effort estimate

Indicative effort for MVP through phase 6 (assumes access to existing engine, API credentials for NYC data sources, and production team available for UAT):

ScopeHours (range)
Phase 1 MVP (phases 1–5)280–380 hrs
Phase 6 hardening & title-chain spike40–60 hrs
Title-chain module (later phase)120–180 hrs
Write-back + Slack workflows (later phase)60–90 hrs
Ongoing maintenance (adapter updates, health monitoring)10–16 hrs/month

At 25–30 hrs/week, MVP delivery would land in roughly 10–14 weeks. Milestone-based pricing aligned to phases 1–5 would de-risk the engagement for both parties — each phase ships a testable increment before the next begins.

Glossary

TermMeaning
BBLBorough-Block-Lot — NYC's canonical parcel identifier tying datasets together
PLUTOPrimary Land Use Tax Lot Output — NYC tax lot dataset with building and land use attributes
ACRISAutomated City Register Information System — recorded documents (deeds, mortgages, etc.)
SODA / SocrataNYC Open Data API platform used by most municipal datasets
COOCertificate of Occupancy — DOB document confirming legal use of a building
HPD / DOB / ECBHousing Preservation, Department of Buildings, Environmental Control Board — major violation systems
ProvenanceMetadata tracing each result line to its official source fetch (API, query, timestamp, hash)
Verify-linkLive URL to the official government record — one-click human confirmation of any line item
HITLHuman-in-the-loop — reviewer sign-off gate before a search is finalized and delivered
GeoclientNYC official geocoding service for address → BBL resolution