01 / Accuracy
Per-field accuracy
A field is correct if it matches the ground-truth string after type-aware normalization (whitespace collapse, case for VAT IDs, ISO date format, IBAN mod-97). Reported per field, then aggregated.
Public benchmark
We're running parsr against an open fixture set. Here's how, what we measure, and how to run it yourself. First numbers publish 2026-06-06.
01 / Fixture set
A benchmark is only as honest as the documents it's run on. Public document-AI benchmarks have a long history of being constructed around a single vendor's strengths — clean English-language US invoices, generated synthetic payslips, or scans of templates that don't look like the noisy reality of European retail banking.
parsr-benchmark-v1 is a 140-document fixture set drawn from real, anonymized production-shape PDFs across the four document types we ship today. The country distribution is intentionally EU-weighted because that's where parsr lives, and within each country we cover the formats we actually see in design-partner traffic.
Counts are deliberately small. A 140-document set is the largest we can hand-label to a ground truth that we'd defend in a review. We'd rather measure carefully on 140 fixtures than sloppily on 14,000.
50 fixtures
Bank statements
10 per major EU country — BE, DE, FR, NL, UK. Retail and SMB-business statements; mix of monthly and quarterly periods.
30 fixtures
Payslips
Distributed across SD Worx, DATEV, Sage Paie, Loket, Sage Payroll, ADP. Country-specific deduction codes intact (RSZ, Lohnsteuer, CSG/CRDS, PAYE, FICA).
30 fixtures
Receipts
Five categories: restaurants, retail, travel, fuel, services. Mixed thermal-print and digital-PDF formats; non-Latin scripts excluded from v1.
30 fixtures
Invoices
SMB cloud invoices, ERP-rendered invoices, DACH and France/Benelux variants. Mix of B2B services and B2B goods.
Total: 140anonymized real-world documents. Sourced from design-partner traffic with explicit consent under GDPR Art. 6(1)(f) (legitimate interest); all identifying fields — names, account numbers, totals, dates that bound a person — are replaced with realistic-looking synthetic values before the file ever leaves the partner's tenant.
02 / Metrics
One number is never enough. A parser can be field-perfect on the easy cases and silently wrong on the hard ones; it can validate to schema and still produce arithmetic that doesn't balance. Five metrics together tell the truth.
01 / Accuracy
A field is correct if it matches the ground-truth string after type-aware normalization (whitespace collapse, case for VAT IDs, ISO date format, IBAN mod-97). Reported per field, then aggregated.
02 / Schema
Share of responses that validate against the published JSON Schema for the doc type — bank_statement.v2, payslip.v1, receipt.v1, invoice.v1. Zero tolerance for missing required fields.
03 / Validators
Share of responses where the domain validator returns valid=true on a known-correct fixture. balance_chain for bank statements, net_pay_match for payslips, totals_reconcile for invoices and receipts.
04 / Latency
Wall-clock time from POST to async result available. Measured client-side from the same EU region as the API endpoint to remove network noise. No retries counted; first attempt only.
05 / Cost
Billed cents per page, computed from the same usage events that drive customer billing (billable=TRUE). Not a list-price calculation — the actual line item we'd invoice.
03 / Field correctness
The single most important section on this page. Most benchmark disagreements turn out to be quiet disagreements about what “correct” means. Here's ours, in full.
| Type | Match rule |
|---|---|
| String (name) | Trim + collapse whitespace; case-insensitive equality. |
| String (IBAN) | Strip spaces, uppercase; mod-97 check must pass; then equality. |
| String (VAT ID) | Uppercase, strip spaces and country prefix separators; equality. |
| Date | Parse to ISO 8601 day boundary; then equality. Time-zone-naive. |
| Money amount | String equality of amount + ISO 4217 currency; tolerance 0 (no rounding). |
| Array of obj | Per-item match by index; no partial credit awarded. |
| Boolean | Strict equality. |
| Enum | Strict equality against the published enum values; case-sensitive. |
When the rule above says “tolerance 0,” we mean it. A money amount that's off by one cent is wrong. A date in the right calendar week but the wrong day is wrong. We don't award partial credit at the field level because partial credit is how a benchmark starts lying to you.
04 / Reproducible
When the open fixture set ships, you can run parsr against it from your own infra. We publish three things, all under MIT:
git clone https://github.com/tryparsr/benchmark
cd benchmark
export PARSR_API_KEY=sk_eu_test_… # any sandbox key works
python -m parsr_benchmark.run \
--doc-type bank_statement \
--fixtures fixtures/be/ \
--output results/be.json
The snippet is illustrative — the harness ships alongside the first measured run on 2026-06-06. If you want early access for your own evaluation, email results@tryparsr.dev.
05 / What we'll publish
Each fixture group (doc type × country) gets its own row. No single “parsr accuracy” headline — those numbers flatter the easy categories and hide the hard ones.
// Illustrative — pending first measurement run on 2026-06-06
{
"fixture_set": "parsr-benchmark-v1 (140 docs)",
"doc_type": "bank_statement",
"country": "BE",
"n_fixtures": 10,
"field_accuracy": null, // populated 2026-06-06
"schema_conformance": null, // populated 2026-06-06
"validator_pass": null,
"latency_p50_ms": null,
"latency_p95_ms": null,
"latency_p99_ms": null,
"cost_per_page": null,
"methodology_v": 1,
"harness_version": "0.1.0",
"fixture_checksum": "sha256:…"
}
Every measured field in this snippet is null on purpose. We will not pre-fill these with synthetic numbers, even for layout demonstration. The first run that produces real values is scheduled for 2026-06-06; the result file format is frozen by methodology v1.
06 / Comparison framing
We did notrun the same fixtures against Mindee, Reducto, or DocuPipe. Their commercial terms generally prohibit competitive benchmarking, and we don't want to publish numbers we can't stand behind in a deposition. What we publish is parsr's numbers on parsr's fixture set — alongside the methodology and harness so anyone can reproduce them.
Where competitors publish their own numbers, we'll quote those quotes. We won't reframe them, average them, or set them next to ours as if they were measured the same way:
We invite independent benchmarks. Run parsr against your data. Email results@tryparsr.dev — we'll respect what you measure, even if it's lower than our published number, and we'll link to your write-up if you publish one.
07 / v1 numbers
The table mixes measured and projectednumbers and labels each per-cell. Schema conformance, validator pass rates, and latency / cost are measured on a 21-document bootstrapped set we ran on 2026-05-09. Field accuracyis still projected — the bootstrap is self-labeled (the LLM produced its own ground truth), so we can't honestly score field-level accuracy against it. The independent hand-labeled run on 2026-06-06 replaces the projected accuracy column with measured numbers.
| Specialist (N) | Schema | Validator pass rates | Field accuracy | Latency / cost |
|---|---|---|---|---|
| bank_statement N=4 | 100% measured | balance_chain: 50% (2/4) date_monotonicity: 100% (4/4) measured | 94–96% projected | p50 ~3.2s / p95 ~9s cost ~3.5¢ projected |
| payslip N=3 | 100% measured | net_pay_match: 100% (3/3) field_completeness: 100% (3/3) measured | 91–94% projected | p50 ~2.8s / p95 ~7s cost ~2.8¢ projected |
| receipt N=5 | 100% measured | totals_reconcile: 80% (4/5) line_items_sum: 80% (4/5) tax_jurisdiction_match: 40% (2/5) measured | 89–93% projected | p50 ~2.4s / p95 ~6s cost ~2.2¢ projected |
| invoice N=9 | 100% measured | line_items_sum: 100% (9/9) totals_reconcile: 100% (9/9) vat_format_valid: 56% (5/9) measured | 92–95% projected | p50 ~3.0s / p95 ~8s cost ~3.0¢ projected |
| classify N=— | — measured | n/a: n/a (no validators) measured | 96–98% projected | p50 ~1.2s / p95 ~3s cost ~1.0¢ projected |
The 21-doc bootstrap mixes real third-party documents (Cloudflare, Hetzner, Exoscale, DigitalOcean, RunCloud, and Stripe-issued invoices, plus HSBC + Big Bank statements, a cross-jurisdiction payslip set, and grocery / digital receipts). All schemas conformed; all-passing validators on payslip and invoice arithmetic; balance_chain at 50% on bank statements (n=4) is small-sample noise but worth tracking — multi-page extractions with table continuations are the failure mode. vat_format_valid at 56% on invoices reflects SaaS providers issuing US-style tax IDs that our EU/UK regex doesn't match — by design, since the validator is for EU VAT format compliance.
The bootstrap manifest lives at docs/internal/measurements/bootstrap-2026-05-09.json. Reproducible by anyone with API access: doppler run -- uv run python -m scripts.summarize_fixtures.
We'll cite your numbers if you publish them. Lower than ours is fine — “you measured it, we'll respect it” is the whole point of this page.