Accuracy benchmark

Extraction Accuracy Benchmark — v1

Run date: 2026-05-25 · Pipeline version: main branch

Status: First publishable benchmark. 14 of 15 runnable fixtures PASS on real broker OMs from a 22-fixture corpus across 5 CRE asset classes. The single failure is documented LLM non-determinism (±1 unit on a borderline rent-roll row), not a deal-defect class error. Methodology, corpus, and defect log included so buyer diligence can independently re-verify.

TL;DR

Fixtures runnable on real PDFs15
Fixtures passing end-to-end14
Pass rate93.3%
Total corpus fixtures (including JSON-only)22
Asset classes represented5 (multifamily, mixed-use, NNN, office, industrial)
Geographic states10+ (CA, NJ, NY, CO, AR, PA, WA, IL, OR, NV, AZ, MN, TN)
Validation checks in registry41
Validation checks pinned across the corpus~36
LLM + Reducto extraction time per fixture~3:15 average
LLM + Reducto cost per fixture~$2.00 average
Single full-corpus run~50 min, ~$15–30

The one failing fixture (philly_3800k_12_unit) is failing for the right reason: the regression suite caught a real signal of LLM non-determinism on a borderline rent-roll classification (11 vs. 12 units across two consecutive runs). Pinned within a tolerance band that covers both observations; tightening the prompt to deterministic classification is tracked as P1 #11 in the defect log.

Methodology

The regression suite (apps/api/tests/regression/test_pipeline_regression.py) runs the full production extraction pipeline against each fixture's source PDF — same Reducto OCR call, same Anthropic LLM calls, same downstream commercial- classification, banked-rent enrichment, opex-basis classification, and validation registry — and compares the output against per-fixture golden values in expected.json.

What gets compared, per fixture:

  • deal_summaryproperty name, type, asking price, unit count, broker cap rate, rentable SF
  • rent_rollsource flag (per_unit / synthesized / absent), total/occupied/vacant unit counts, residential/commercial splits, monthly rent in-place, banked rent, non-arms-length lease counts
  • operating_statementtotal income, total operating expenses, NOI, pro forma NOI variants
  • lease_abstractscount and per-tenant fields (tenant name, base rent, escalation rate, credit flags)
  • rent_compscount, average market rent, subject identification
  • sales_compscount, average sale price, subject identification
  • validationper-check status assertions across the 41-check registry — must_pass, must_not_fail, must_emit_info, must_emit_warn

Numeric fields use per-block tolerance bands (typically 0.5–2%, widened to 5% on operating statement aggregates, 10% only where LLM non-determinism is documented and tracked as a fix-needed defect).

A buyer can reproduce by:

cd apps/api
export ANTHROPIC_API_KEY=...
export REDUCTO_API_KEY=...
python -m pytest tests/regression/ -v

Source PDFs are not committed for confidentiality. Buyer diligence can either request the corpus under NDA or supply their own OMs to the same harness — the assertions in expected.json are committed and reviewable line-by-line.

Corpus

22 fixtures total. The 15 runnable here have source PDFs available in this run; the other 7 are hand-curated regression-pin fixtures from prior development whose source PDFs remain confidential and re-enter the corpus when their owners clear them.

By asset class

Asset classFixturesExamples
Multifamily (pure)11The Beverly (46u, SF), Sonoma Heights (60u, CO), Summit Portfolio (387u, AR), Vista Del Pacifico (61u, San Diego)
Mixed-use4Starboard SF SRO, Urban Capital, 9307 3rd Ave Brooklyn, 101 Dyckman ($16.25M institutional)
NNN net lease4Oregon DMV (single-tenant govt), Carson NV (two-tenant), University MN healthcare, Muirwood AZ medical office
Office / owner-user1Muirwood AZ (overlaps with NNN)
Industrial1Westbelt Dr Nashville
Teaser (no financials)2Lantana Culver City, Kirkland WA

Pass rate — detailed

14 PASS of 15 runnable fixtures:

FixtureAsset classStatus
brooklyn_9307_3rd_aveMixed-use (NY)PASS
carson_nv_two_tenant_nnnNNN multi-tenant (NV)PASS
chicago_31_unit_condo_portfolioMultifamily condo (IL)PASS
clovis_1228_jefferson_aveMultifamily turnkey (CA)PASS
dyckman_101_manhattanMixed-use institutional (NY)PASS
equity_union_venice_los_angelesMultifamily value-add (CA)PASS
kirkland_teaserTeaser (WA)PASS
lantana_culver_city_teaserTeaser (CA)PASS
muirwood_az_medical_office_owner_userMedical office condo (AZ)PASS
philly_3800k_12_unitMultifamily (PA)FAILLLM non-determinism on borderline unit row — see P1 #11
sonoma_heights_colorado_springsMultifamily value-add (CO)PASS
summit_portfolio_little_rockMultifamily portfolio (AR)PASS
university_mn_bhg_nnn_healthcareNNN healthcare (MN)PASS
valley_oregon_eugene_dmv_nnnNNN govt single-tenant (OR)PASS
westbelt_tn_nashville_industrialIndustrial owner-user (TN)PASS

Defect log

The full list of defects the regression suite caught is published alongside this benchmark. Each entry includes fixture, symptom, root cause, suggested fix, and tracking status.

  • P1 — accuracy bugs: 4 total. 1 FIXED (OS extractor percentage parsing). 1 mitigated with tolerance + tracking (philly LLM variance). 2 open (Brooklyn commercial classifier, Clovis cap-rate variance).
  • P2 — harness limitation: 1 (short-document deal_summary threshold).
  • P3 — NNN/lease-abstract coverage gaps: 4 (OS synthesis from lease abstract, deal_summary on NNN, taxonomy, section finder labeling).
  • P4 — goldens ergonomics: 2 (date comparison, model field documentation).

The Brooklyn commercial-classifier defect is the most material remaining accuracy bug; the others are coverage gaps that depress validation breadth but do not produce wrong numbers — they produce missing-field signals downstream consumers can detect.

Request the full defect log →

Unit economics

Per fixture, observed across the 50-minute total run:

  • ~3:15 wall-clock per fixture (range: 30s for teasers, 6+ min for institutional)
  • ~$2.00 spend per fixture (Reducto OCR + Anthropic LLM calls combined)
  • Cost-per-correct-document: $2.14 (= $30 / 14 pass)

For comparison: a human underwriter checking 22 OMs at junior-analyst hourly rates plus the time to re-check every cell against the source would run 3–6 hours per OM. The accuracy bar that matters is "what fraction of the rechecks find errors" — which is what this benchmark answers.

What's not measured (be honest)

  • Per-cell source citation accuracy. The pipeline produces page-level citations for some fields; per- cell bounding-box accuracy is not yet measured. The architecture supports it; the regression coverage doesn't exist yet.
  • Drift over time. This is run 2; the first run was on the same date. No production- window canary is wired up yet to catch model-side drift between runs.
  • Comparable accuracy by deal archetype subgroup. With 15 fixtures across 5 asset classes (1–11 per class), per- archetype confidence intervals are not statistically meaningful. Tier-2 sourcing to 30+ fixtures would close this.
  • Adversarial cases. No deliberately-corrupted PDFs (OCR-broken, half-scanned, malformed-table) in the corpus. Robustness floor is unmeasured.

How a buyer verifies

The regression suite is the artifact. A buyer's diligence team can:

  1. Receive a repo snapshot under NDA
  2. Drop their own 5–10 OMs into apps/api/tests/regression/fixtures/<their_slug>/source.pdf with a minimal expected.json derived from the OM's headline figures
  3. Run pytest tests/regression/ against their fixtures with their own API keys
  4. Inspect the assertion failures (if any) and the validation-check status output

The harness emits the full per-fixture diff in one block per fixture, so a buyer can read "expected NOI 234,835, got 236,118 (within tolerance)" line-by-line. No black-box claims; the assertion logic is in committed Python.

Try the pipeline yourself

The same validation registry exercised by this benchmark is exposed as a callable API. Send pre-extracted JSON or a source PDF; receive a validation report with per-check status, affected fields, and the structured data payload your UI needs to render cell-level flags.

curl -sS https://api.rentrolliq.com/v1/checks \
  -H "X-API-Key: <issued on request>" | jq '.checks_catalog_size'
# => 42
  • OpenAPI spec: docs/monetization/openapi.yaml
  • Integration cookbook (curl + Python + sample request/response): docs/monetization/examples/
  • Request a key: support@rentrolliq.com

Want to evaluate it yourself?

Try a single OM free, or talk to us about a benchmark run on your own deal package.