Evaluation
Evaluation Infrastructure
The HIPE-OCRepair evaluation framework is designed to ensure fair, transparent, and reproducible assessment of OCR post-correction systems. It supports both LLM-based and traditional approaches, and evaluates submissions along two complementary dimensions: correction quality and robustness across heterogeneous historical data.
Participants will submit system outputs for two blind test sets in a standardized JSONL format. All evaluation scripts, baseline systems, and submission templates will be released publicly to ensure reproducibility.
Core Evaluation Metrics
HIPE-OCRepair uses metrics specifically suited to historical OCR and multilingual text, where spelling variation and OCR artifacts can make naïve word-based metrics misleading.
CER is the primary evaluation measure for HIPE-OCRepair. It is preferred over WER because historical spelling variation and segmentation differences can artificially inflate word error penalties.
We compute:
- Corpus-level micro-averaged CER (global performance)
- Per-item CER (local performance)
Both are reported with confidence intervals for statistical robustness.
To evaluate how consistently a system improves the noisy input, we also compute an interpretable item-level score sᵢ:
- +1 if the corrected output improves CER
- 0 if CER stays the same
- –1 if the system worsens the text
Formally:
sᵢ = sign(CER_inputᵢ – CER_outputᵢ)
This metric highlights systems that are stable and reliable across diverse documents.
Additionally, we also report:
- WER (word error rate)
- CER/WER per dataset
- Confidence intervals for all reported scores
Submission and Scoring
Submission Format
Participants submit the same provided JSONL files that have a placeholder for the system’s post-corrected output.
Scoring
All submissions are evaluated with:
- official scoring scripts (Python)
- a public online leaderboard (Hugging Face)
Reproducibility and Resources
All datasets, scripts, and baseline systems will be released on:
- GitHub (evaluation scripts, scoring tools)
- Hugging Face (datasets + leaderboard)
- Zenodo (archived version for citation)
The leaderboard will remain active after ICDAR 2026 to support long-term benchmarking and community contributions.
Updates
📢 The official GitHub repository for HIPE-OCRepair evaluation scripts and starter code will be announced soon.
Please check back or join our HIPE-OCRepair 2026 mailing list for updates.