Public benchmark

Measured accuracy, published methodology

We're running parsr against an open fixture set. Here's how, what we measure, and how to run it yourself. First numbers publish 2026-06-06.

Read methodology Try parsr on your data — 200 free pages →

Methodology v1Methodology v1 published 2026-05-09. First measured results: 2026-06-06. We'd rather ship the methodology a month before the numbers than ship numbers without one.

01 / Fixture set

What we'll measure on

A benchmark is only as honest as the documents it's run on. Public document-AI benchmarks have a long history of being constructed around a single vendor's strengths — clean English-language US invoices, generated synthetic payslips, or scans of templates that don't look like the noisy reality of European retail banking.

parsr-benchmark-v1 is a 140-document fixture set drawn from real, anonymized production-shape PDFs across the four document types we ship today. The country distribution is intentionally EU-weighted because that's where parsr lives, and within each country we cover the formats we actually see in design-partner traffic.

Counts are deliberately small. A 140-document set is the largest we can hand-label to a ground truth that we'd defend in a review. We'd rather measure carefully on 140 fixtures than sloppily on 14,000.

50 fixtures

Bank statements

10 per major EU country — BE, DE, FR, NL, UK. Retail and SMB-business statements; mix of monthly and quarterly periods.

30 fixtures

Payslips

Distributed across SD Worx, DATEV, Sage Paie, Loket, Sage Payroll, ADP. Country-specific deduction codes intact (RSZ, Lohnsteuer, CSG/CRDS, PAYE, FICA).

30 fixtures

Receipts

Five categories: restaurants, retail, travel, fuel, services. Mixed thermal-print and digital-PDF formats; non-Latin scripts excluded from v1.

30 fixtures

Invoices

SMB cloud invoices, ERP-rendered invoices, DACH and France/Benelux variants. Mix of B2B services and B2B goods.

Total: 140anonymized real-world documents. Sourced from design-partner traffic with explicit consent under GDPR Art. 6(1)(f) (legitimate interest); all identifying fields — names, account numbers, totals, dates that bound a person — are replaced with realistic-looking synthetic values before the file ever leaves the partner's tenant.

02 / Metrics

Five things we measure per fixture

One number is never enough. A parser can be field-perfect on the easy cases and silently wrong on the hard ones; it can validate to schema and still produce arithmetic that doesn't balance. Five metrics together tell the truth.

01 / Accuracy

Per-field accuracy

A field is correct if it matches the ground-truth string after type-aware normalization (whitespace collapse, case for VAT IDs, ISO date format, IBAN mod-97). Reported per field, then aggregated.

02 / Schema

Schema conformance rate

Share of responses that validate against the published JSON Schema for the doc type — bank_statement.v2, payslip.v1, receipt.v1, invoice.v1. Zero tolerance for missing required fields.

03 / Validators

Validator pass rate

Share of responses where the domain validator returns valid=true on a known-correct fixture. balance_chain for bank statements, net_pay_match for payslips, totals_reconcile for invoices and receipts.

04 / Latency

Latency (p50, p95, p99)

Wall-clock time from POST to async result available. Measured client-side from the same EU region as the API endpoint to remove network noise. No retries counted; first attempt only.

05 / Cost

Cost per page

Billed cents per page, computed from the same usage events that drive customer billing (billable=TRUE). Not a list-price calculation — the actual line item we'd invoice.

03 / Field correctness

How we define “correct” per type

The single most important section on this page. Most benchmark disagreements turn out to be quiet disagreements about what “correct” means. Here's ours, in full.

Type	Match rule
String (name)	Trim + collapse whitespace; case-insensitive equality.
String (IBAN)	Strip spaces, uppercase; mod-97 check must pass; then equality.
String (VAT ID)	Uppercase, strip spaces and country prefix separators; equality.
Date	Parse to ISO 8601 day boundary; then equality. Time-zone-naive.
Money amount	String equality of amount + ISO 4217 currency; tolerance 0 (no rounding).
Array of obj	Per-item match by index; no partial credit awarded.
Boolean	Strict equality.
Enum	Strict equality against the published enum values; case-sensitive.

When the rule above says “tolerance 0,” we mean it. A money amount that's off by one cent is wrong. A date in the right calendar week but the wrong day is wrong. We don't award partial credit at the field level because partial credit is how a benchmark starts lying to you.

04 / Reproducible

Run the benchmark yourself

When the open fixture set ships, you can run parsr against it from your own infra. We publish three things, all under MIT:

Fixture set — the 140 anonymized PDFs and their ground-truth JSON, published as a GitHub release. Forward link: /docs/benchmark/fixtures.
Eval harness — a small Python package that runs against any parsr-compatible API and computes the five metrics above. Forward link: /docs/benchmark/harness.
Reproducibility checksum — a SHA-256 over the fixture set plus harness version, embedded in every result. If you publish numbers, the checksum tells us (and you) that we're comparing the same run.

shell — run the benchmarkillustrative

git clone https://github.com/tryparsr/benchmark
cd benchmark

export PARSR_API_KEY=sk_eu_test_…   # any sandbox key works
python -m parsr_benchmark.run \
    --doc-type bank_statement \
    --fixtures fixtures/be/ \
    --output results/be.json

The snippet is illustrative — the harness ships alongside the first measured run on 2026-06-06. If you want early access for your own evaluation, email results@tryparsr.dev.

05 / What we'll publish

Per doc type, per country, per metric

Each fixture group (doc type × country) gets its own row. No single “parsr accuracy” headline — those numbers flatter the easy categories and hide the hard ones.

results/bank_statement/be.jsonillustrative — pending 2026-06-06

// Illustrative — pending first measurement run on 2026-06-06
{
  "fixture_set":        "parsr-benchmark-v1 (140 docs)",
  "doc_type":           "bank_statement",
  "country":            "BE",
  "n_fixtures":         10,
  "field_accuracy":     null,    // populated 2026-06-06
  "schema_conformance": null,    // populated 2026-06-06
  "validator_pass":     null,
  "latency_p50_ms":     null,
  "latency_p95_ms":     null,
  "latency_p99_ms":     null,
  "cost_per_page":      null,
  "methodology_v":      1,
  "harness_version":    "0.1.0",
  "fixture_checksum":   "sha256:…"
}

Every measured field in this snippet is null on purpose. We will not pre-fill these with synthetic numbers, even for layout demonstration. The first run that produces real values is scheduled for 2026-06-06; the result file format is frozen by methodology v1.

06 / Comparison framing

These are parsr's numbers — and we won't fake yours

We did notrun the same fixtures against Mindee, Reducto, or DocuPipe. Their commercial terms generally prohibit competitive benchmarking, and we don't want to publish numbers we can't stand behind in a deposition. What we publish is parsr's numbers on parsr's fixture set — alongside the methodology and harness so anyone can reproduce them.

Where competitors publish their own numbers, we'll quote those quotes. We won't reframe them, average them, or set them next to ours as if they were measured the same way:

Mindee claims 90%+ accuracy on invoices across 50+ countries (Mindee's published number, on Mindee's evaluation set).
Reducto claims 99.24% accuracy on clinical documents (Reducto's published number, on Reducto's evaluation set — not bank statements, payslips, or invoices).
parsr will publish field-level accuracy on EU bank statements, payslips, receipts, and invoices based on this 140-fixture set, with methodology and harness open.

We invite independent benchmarks. Run parsr against your data. Email results@tryparsr.dev — we'll respect what you measure, even if it's lower than our published number, and we'll link to your write-up if you publish one.

07 / v1 numbers

What we've measured, what we're still projecting

The table mixes measured and projectednumbers and labels each per-cell. Schema conformance, validator pass rates, and latency / cost are measured on a 21-document bootstrapped set we ran on 2026-05-09. Field accuracyis still projected — the bootstrap is self-labeled (the LLM produced its own ground truth), so we can't honestly score field-level accuracy against it. The independent hand-labeled run on 2026-06-06 replaces the projected accuracy column with measured numbers.

Specialist (N)	Schema	Validator pass rates	Field accuracy	Latency / cost
bank_statement N=4	100% measured	balance_chain: 50% (2/4) date_monotonicity: 100% (4/4) measured	94–96% projected	p50 ~3.2s / p95 ~9s cost ~3.5¢ projected
payslip N=3	100% measured	net_pay_match: 100% (3/3) field_completeness: 100% (3/3) measured	91–94% projected	p50 ~2.8s / p95 ~7s cost ~2.8¢ projected
receipt N=5	100% measured	totals_reconcile: 80% (4/5) line_items_sum: 80% (4/5) tax_jurisdiction_match: 40% (2/5) measured	89–93% projected	p50 ~2.4s / p95 ~6s cost ~2.2¢ projected
invoice N=9	100% measured	line_items_sum: 100% (9/9) totals_reconcile: 100% (9/9) vat_format_valid: 56% (5/9) measured	92–95% projected	p50 ~3.0s / p95 ~8s cost ~3.0¢ projected
classify N=—	— measured	n/a: n/a (no validators) measured	96–98% projected	p50 ~1.2s / p95 ~3s cost ~1.0¢ projected

The 21-doc bootstrap mixes real third-party documents (Cloudflare, Hetzner, Exoscale, DigitalOcean, RunCloud, and Stripe-issued invoices, plus HSBC + Big Bank statements, a cross-jurisdiction payslip set, and grocery / digital receipts). All schemas conformed; all-passing validators on payslip and invoice arithmetic; balance_chain at 50% on bank statements (n=4) is small-sample noise but worth tracking — multi-page extractions with table continuations are the failure mode. vat_format_valid at 56% on invoices reflects SaaS providers issuing US-style tax IDs that our EU/UK regex doesn't match — by design, since the validator is for EU VAT format compliance.

The bootstrap manifest lives at docs/internal/measurements/bootstrap-2026-05-09.json. Reproducible by anyone with API access: doppler run -- uv run python -m scripts.summarize_fixtures.

200 free pages. Run parsr on your data.

We'll cite your numbers if you publish them. Lower than ours is fine — “you measured it, we'll respect it” is the whole point of this page.

Start building Send us your results