Production Property-Records Search Platform for NYC Open Data
A proposed property records search platform for a New York real-estate services company that already pulls municipal data from NYC's open-data ecosystem. The build would wrap the existing data engine in a production layer — an internal ordering portal, a reconciliation pipeline across 80+ official datasets with verify-links on every result line, human-in-the-loop review before finalization, and branded PDF delivery tracked by internal file number.
Proposed outcome: A daily-use internal product that replaces third-party vendor searches for NYC municipal/departmental records — turning raw open data into certified, client-ready reports with one-click source verification on every line item.
Scenario
This brief describes a proposed solution — not a delivered engagement. It maps a pattern common among title and settlement shops: proven internal data pulls, but no production portal to order, verify, and deliver searches in-house.
- Organization: Established NYC real-estate services company with two decades of title and settlement experience and a serious internal automation stack (AI pipelines, CRM intelligence, financial systems)
- Existing asset: Working data engine that already queries NYC open-data sources — the build extends this engine, not replaces it
- Owner profile: Expert full-stack engineer (or tight lead + helper) owning backend pipeline and internal portal UI; long-term multi-phase partnership
- Geography: NYC five boroughs for MVP; surrounding counties in later phases (fundamentally different data model — more intake automation than direct API pulls)
- Users: Production team ordering searches against properties and internal file numbers, reviewing results, signing off before delivery
- Liability context: Search outputs carry certification implications — auto-publish without human gate is unacceptable
Problem
The company pays third-party vendors for municipal and title searches. The underlying data is largely free from official NYC and NY State sources, and internal proof-of-concept pulls already work. What is missing is the production layer around that data.
- Vendor cost and latency — recurring fees for data accessible directly; turnaround tied to external vendor queues
- No unified workflow — orders, status, review, and delivery scattered across email, vendor portals, and spreadsheets
- Messy government data — 80+ datasets with inconsistent schemas, pagination, rate limits, stale records, and occasional errors
- Provenance gap — vendor PDFs rarely expose live links back to the official source record for line-by-line verification
- Property key fragmentation — datasets join on address, BBL (borough/block/lot), parcel ID, or agency-specific identifiers inconsistently
- Liability without gates — publishing a search without human review and confidence scoring creates certification risk
- No file-number spine — internal title-production software tracks files; search results must map to that identifier for write-back in later phases
Requirements
Functional (Phase 1 MVP)
- Ordering portal — order a municipal/departmental search against a property and internal file number; select search types (COO, violations across DOB/HPD/ECB, tax/water/sewer, fire, housing, zoning, environmental, etc.)
- BBL resolution spine — address → geocode → borough/block/lot as the canonical property key tying every dataset
- Data pipeline — resilient async jobs querying Socrata/SODA, ArcGIS REST, and agency record systems; normalize into one result set per search
- Verify-link on every line — each result item includes a live URL to the official source record (non-negotiable product requirement)
- Reconciliation — merge duplicates, flag conflicts between sources, preserve raw provenance
- Status dashboard — searches grouped under files with green / needs-review / problem states
- Human-in-the-loop review — confidence scoring + sign-off gate before finalization
- Branded PDF reports — dynamic generation stored and retrievable by file number
- Audit trail — who ordered, when data was fetched, reviewer sign-off, PDF version history
Functional (later phases)
- Title-chain module — recorded documents, judgments, liens, lis pendens, tax warrants, estate/probate, owner-name searches
- Write-back into existing title-production software via its API
- Slack-based approval and notification workflows
- Surrounding county expansion where no unified open-data API exists
Non-functional
- Reliability — 80+ external APIs must not silently break; health checks, schema drift alerts, circuit breakers
- Provenance — immutable fetch metadata (source API, query, timestamp, raw payload hash) per result line
- Security — role-scoped portal access; secrets outside repo; R2 signed URLs for PDFs
- Performance — queue-backed parallel fetches; search completes in minutes, not hours
- Observability — per-dataset success rates, latency, error taxonomy
- Extensibility — new datasets added via config + adapter module, not forked pipeline code
Architecture
Three tiers: a portal + workflow layer (React + Workers API), an async data plane (Queues + fetch workers wrapping the existing engine), and a provenance + delivery layer (D1 registry, R2 PDFs, review queue).
Platform architecture — portal and Workers API orchestrate BBL resolution, queue-backed fetches via the existing engine, reconciliation with provenance, and PDF delivery
Search execution — parallel queue fetches, reconciliation with confidence scoring, mandatory review gate, then PDF delivery with verify-links
Component map by platform tier (major services per layer)
End-to-end flow
Municipal search lifecycle — every result line carries a verify-link; nothing finalizes without reviewer sign-off
Indicative phase-1 dataset coverage by category (% of 80+ source adapters)
Typical search lifecycle — median vs p95 duration by stage (illustrative)
Recommended stack
Recommendation: Cloudflare Workers for compute, D1 for the relational registry, R2 for PDFs and raw snapshots, Queues for parallel dataset fetches, and a React portal on Cloudflare Pages — wrapping the existing data engine via adapter, not rebuild.
| Layer | Technology | Why |
|---|---|---|
| Compute | Cloudflare Workers | Edge-native, consistent with client's newer systems; fast cold start for API routes and webhook handlers |
| Database | D1 (SQLite) | Search orders, file registry, result lines, provenance, review state — relational model fits audit and dashboard queries |
| Object storage | R2 | PDF reports, raw fetch snapshots, large JSON payloads; no egress fees to Workers |
| Async jobs | Queues + Cron Triggers | Parallel dataset pulls; scheduled health checks and schema drift detection |
| Frontend | React (Vite) on Pages | Internal portal — order form, file-grouped dashboard, review UI with verify-link previews |
| React-PDF or Browser Rendering API | Branded templates with clickable verify-link per result line | |
| Geocoding | NYC Geoclient / GeoSearch + PLUTO | Official NYC address → BBL resolution; tax lot attributes for cross-dataset joins |
| Existing engine | Worker service binding or HTTP adapter | Preserve proven pull logic — wrap, do not rewrite from scratch |
| Auth | Cloudflare Access | SSO for internal production team; no custom auth to maintain |
Why Cloudflare over AWS Lambda? The organization has standardized on Cloudflare for newer systems — Workers, D1, R2, and Queues co-locate without cross-service latency for metadata joins. AWS Lambda + DynamoDB + S3 remains viable for teams with deep existing investment, but stack consistency reduces ops surface here.
Why D1 over DynamoDB? Search results are inherently relational (files → searches → result_lines → provenance_records). SQL filters for the status dashboard and certification audit reports are simpler than single-table Dynamo patterns.
Provenance record shape (every result line would carry this metadata in D1):
interface ProvenanceRecord {
id: string;
result_line_id: string;
source_api: string; // e.g. "data.cityofnewyork.us/resource/..."
source_record_id: string; // agency-native ID for URL rebuild
source_url: string; // live verify-link (required)
query_params: string; // JSON: BBL, address, date range used
fetched_at: string; // ISO timestamp
payload_hash: string; // SHA-256 of raw response
adapter_version: string; // pin schema expectations
}
Component design
1 — Order & workflow service
- Input: file number, address or BBL, search type checklist, requester ID
- Output:
search_id, status transitions (queued→fetching→reconciling→review→approved→delivered) - QA gate: validate file number format; reject duplicate in-flight search for same file + property + type
2 — Property key resolver
- Input: street address + borough, or raw BBL
- Output: normalized BBL, lat/lon, PLUTO attributes (building class, units, land use)
- Tools: Geoclient API, PLUTO dataset join, fallback manual BBL entry flagged for reviewer
3 — Dataset fetch orchestrator
- Input: BBL + enabled dataset adapters for search type
- Output: raw payloads per adapter; fetch metadata logged to D1
- Pattern: Queue consumer per adapter family (SODA batch, ArcGIS REST, engine-backed) with shared retry, backoff, and circuit-breaker config
4 — Reconciliation & provenance engine
- Input: raw adapter results
- Output: normalized
result_lineswithsource_url,source_api,fetched_at,confidence_factors - Rules: dedupe by agency record ID; flag address mismatches against BBL; never drop provenance on merge
5 — Confidence scorer + review queue
- Input: reconciled result set
- Output: per-line and per-search confidence score; route conflicts and low-confidence items to review inbox
- QA gate: no PDF generation until reviewer sign-off — search outputs carry liability
6 — PDF report generator
- Input: approved result set + branding template
- Output: PDF in R2; verify-link rendered as clickable URL per line
- Versioning: new PDF version on re-run; prior versions retained for audit
7 — File & search registry
| Table | Purpose |
|---|---|
files | Internal file numbers; link to title-production system ID (phase 2 write-back) |
searches | Order metadata, BBL, status, requester, timestamps |
result_lines | Normalized findings with category, severity, verify-link |
provenance_records | Immutable fetch metadata per line |
review_events | Reviewer actions, notes, sign-off timestamps |
pdf_artifacts | R2 keys, version, generated_at |
Indicative MVP effort distribution by phase (% of total hours)
Implementation plan
Phase 1 — Foundation & engine wrap (week 1–3)
Provision Workers, D1 schema, R2 buckets, Queues. HTTP adapter wrapping existing data engine with health endpoint. BBL resolver service with PLUTO join. Basic REST API: create search, get status, list by file number.
Risk: Undocumented engine APIs — schedule pairing sessions with internal team in week 1. Rollback: read-only portal against engine without write path.
Phase 2 — Dataset adapter library (week 4–7)
Implement shared adapter interface; port highest-priority datasets (DOB violations, HPD, ECB, tax/lien, COO, zoning, environmental). Shared SODA client with pagination, rate-limit handling, schema version pins. Cron health checks per dataset with Slack alerts on failure.
Risk: API schema drift — pin expected columns; alert on missing fields before silent empty results. Rollback: disable adapter via feature flag without stopping other datasets.
Phase 3 — Portal UI (week 8–10)
React portal: order form with address autocomplete, file-grouped dashboard (green / needs-review / problem), search detail view with result lines and verify-link previews. Cloudflare Access for SSO.
Risk: UX friction slows production team adoption — embed reviewers in weekly UAT. Rollback: API-only mode for power users until UI polished.
Phase 4 — Reconciliation & provenance (week 11–13)
Normalization rules, conflict detection, immutable provenance store. Verify-link validator (HTTP HEAD check that source URL resolves). Conflict UI showing side-by-side source records.
Risk: Source URLs change format — store agency record ID separately so links can be rebuilt. Rollback: show raw links even if validator fails; flag for manual review.
Phase 5 — Review gate & PDF delivery (week 14–16)
Review queue UI, sign-off workflow, confidence scoring v1. Branded PDF template with per-line verify links and file number header. R2 storage + retrieval by file number from portal.
Risk: PDF layout breaks on long violation lists — paginate with continuation sheets. Rollback: HTML report fallback until PDF template stable.
Phase 6 — Hardening & title-chain prep (week 17–20)
Load testing on 80+ parallel fetches, ops dashboards, runbooks, on-call alerts for dataset outages. Document adapter-add guide for internal team. Spike ACRIS and title-chain data sources for phase 2 roadmap; draft write-back API spec for title-production software.
Risk: Silent pipeline degradation over time — mandatory weekly dataset health report. Rollback: vendor fallback path for critical search types until confidence KPIs met.
Reporting & ops
| Signal | Source | Cadence |
|---|---|---|
| Search completion time | searches timestamps | Real-time dashboard |
| Per-dataset fetch success rate | Adapter Worker logs | Daily; alert if below 95% |
| Review queue depth | review_events | Alert if > N items or > 24h unreviewed |
| Verify-link broken rate | Link validator cron job | Weekly trend report |
| API schema drift | Health check cron | Immediate Slack alert |
| PDF generation failures | Worker error logs | Real-time alert |
| Vendor vs in-house cost | Order volume × vendor rate | Monthly leadership summary |
Ops cadence would include a weekly 30-minute pipeline standup (failed datasets, review backlog, schema changes) and a monthly review with leadership (search volume, turnaround time, vendor cost avoided, phase 2 readiness). On-call rotation for prod Worker errors and dataset-wide outage only — not per-search alerts.
Proposed deliverables
Following the phased plan, a build would ship these artifacts:
- Cloudflare Workers API with D1 schema, R2 buckets, and Queue topology documented
- Adapter library for 80+ NYC open-data sources with shared provenance model
- BBL resolution service (Geoclient + PLUTO join)
- HTTP adapter layer wrapping the existing data engine (no from-scratch rebuild)
- React internal portal — order form, file-grouped status dashboard, review UI, PDF download
- Human-in-the-loop review queue with confidence scoring and sign-off audit trail
- Branded PDF report generator with verify-link on every result line
- Ops dashboard and per-dataset health monitoring with schema drift alerts
- Runbooks: add new dataset adapter, handle API outage, reviewer workflow, credential rotation
- Integration specification for title-production write-back (phase 2)
Effort estimate
Indicative effort for MVP through phase 6 (assumes access to existing engine, API credentials for NYC data sources, and production team available for UAT):
| Scope | Hours (range) |
|---|---|
| Phase 1 MVP (phases 1–5) | 280–380 hrs |
| Phase 6 hardening & title-chain spike | 40–60 hrs |
| Title-chain module (later phase) | 120–180 hrs |
| Write-back + Slack workflows (later phase) | 60–90 hrs |
| Ongoing maintenance (adapter updates, health monitoring) | 10–16 hrs/month |
At 25–30 hrs/week, MVP delivery would land in roughly 10–14 weeks. Milestone-based pricing aligned to phases 1–5 would de-risk the engagement for both parties — each phase ships a testable increment before the next begins.
Glossary
| Term | Meaning |
|---|---|
| BBL | Borough-Block-Lot — NYC's canonical parcel identifier tying datasets together |
| PLUTO | Primary Land Use Tax Lot Output — NYC tax lot dataset with building and land use attributes |
| ACRIS | Automated City Register Information System — recorded documents (deeds, mortgages, etc.) |
| SODA / Socrata | NYC Open Data API platform used by most municipal datasets |
| COO | Certificate of Occupancy — DOB document confirming legal use of a building |
| HPD / DOB / ECB | Housing Preservation, Department of Buildings, Environmental Control Board — major violation systems |
| Provenance | Metadata tracing each result line to its official source fetch (API, query, timestamp, hash) |
| Verify-link | Live URL to the official government record — one-click human confirmation of any line item |
| HITL | Human-in-the-loop — reviewer sign-off gate before a search is finalized and delivered |
| Geoclient | NYC official geocoding service for address → BBL resolution |