Evaluation
Evaluation
The HIPE-OCRepair evaluation framework is designed to ensure fair, transparent, and reproducible assessment of OCR post-correction systems. It supports both LLM-based and traditional approaches across multilingual historical data.
Evaluation period
As indicated in the Timeline, the evaluation period will run from Monday 6 April 2026 (9 a.m. CEST) to Wednesday 8 April 2026 (10 a.m. CEST).
The test sets will be released in the official data repository.
At the end of the evaluation period, participants submit their system outputs via email to the task organisers, as announced via the mailing list and explained in the Participation Guidelines.
Results will be published on Friday 10 April 2026 on the Results page and in a dedicated GitHub repository.
A Hugging Face leaderboard will be set up after the competition with the official rankings, and will remain active for future benchmarking and community contributions.
Metrics
Primary metric: cMER
All primary metrics are based on character-level Match Error Rate (cMER). Unlike standard CER, cMER is bounded in ([0,1]) because insertions are included in the denominator.
\[\text{cMER} = \frac{S + D + I}{H + S + D + I}\]where (H) = hits, (S) = substitutions, (D) = deletions, and (I) = insertions at the character level. A cMER of 0.05 means the hypothesis and reference differ by 5% at the character level.
Normalization
Before scoring in the normalized setup, texts are preprocessed as follows:
- lowercased
- punctuation and other non-word characters replaced by spaces
- underscores replaced by spaces
- repeated whitespace collapsed
Evaluation is therefore case-insensitive and punctuation-insensitive, but remains sensitive to accented characters (for example, é and e are treated as different).
Formal Metric Definitions
Each test dataset consists of a set of transcription units. All metrics are first computed at the level of individual transcription units and then aggregated across them.
cMER and wMER
For a single transcription unit, MER is: \(\mathrm{MER} = \frac{S + D + I}{H + S + D + I}\)
For character-level scoring this gives cMER; for word-level scoring (using jiwer.process_words(...) after normalization) this gives wMER.
Micro vs. macro aggregation
For a dataset with transcription units (i = 1, \ldots, N):
\[\mathrm{cMER}_{\mathrm{macro}} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{cMER}_{i}\] \[\mathrm{cMER}_{\mathrm{micro}} = \frac{\sum_i S_i + \sum_i D_i + \sum_i I_i}{\sum_i H_i + \sum_i S_i + \sum_i D_i + \sum_i I_i}\]wMER_macro and wMER_micro are defined analogously over word-level alignments.
In plain terms:
- Micro aggregation pools alignment counts before computing the rate, so longer transcription units contribute more weight. For example, a long newspaper article influences micro scores more than a short snippet.
- Macro aggregation averages transcription-unit-level scores, so every unit — long or short — contributes equally.
Preference-based metrics
In addition to MER, the scorer reports metrics that compare each system output to the original (uncorrected) OCR. These capture whether a system consistently improves over the OCR baseline, rather than producing large gains on some units while degrading others.
For each transcription unit, the preference score is:
1if the system is better than the raw OCR0if both are equal-1if the system is worse than the raw OCR
The reported metrics are macro averages over transcription units:
\[\mathrm{pref\_score\_cmer\_macro} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{pref}_{\mathrm{cmer}}(i)\] \[\mathrm{pref\_score\_wmer\_macro} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{pref}_{\mathrm{wmer}}(i)\]Rankings
Primary and secondary criteria
The primary ranking metric is cmer_micro (lower is better), computed separately for each dataset. The secondary ranking metric is pref_score_cmer_macro (higher is better), used as a tiebreaker: in case of identical primary scores, runs are ranked by this metric in descending order.
Official competition ranking
The official competition ranking is a weighted mean of per-test-set cmer_micro across the 8 official test sets. The secondary criterion is the corresponding weighted mean of pref_score_cmer_macro.
The weighting scheme is chosen so that each language contributes equally regardless of how many test sets represent it:
- English and French: two test sets each, both with weight 1
- German: four test sets —
impresso-snippetswith weight 1, anddta19-l0,dta19-l1,dta19-l2with weight 1/3 each, so that the three DTA splits together carry the same total weight as any single other test set and German is not overrepresented in the overall ranking
Note that impresso-nzz and overproof-combined do not contribute to the official rankings because they were released publicly before the competition.
Please refer to the Participation Guidelines for full details.
Per-language rankings
In addition to the overall ranking, we report per-language rankings computed as a weighted mean of per-test-set cmer_micro over the official test sets for that language:
- English: mean over
icdar2017andimpresso-snippets - French: mean over
icdar2017andimpresso-snippets - German:
impresso-snippetswith weight1;dta19-l0,dta19-l1,dta19-l2with weight1/3each
The secondary criterion for per-language rankings is the corresponding weighted mean of pref_score_cmer_macro.
Scorer
It is possible to use the HIPE-OCRepair-scorer (also available as pip install).
Results
Evaluation results will be available on the results page and in the HIPE-OCRepair GitHub repository (publicly released after the competition), and will be presented at ICDAR 2026.
Submission and Scoring
Please refer to the Participation Guidelines for detailed instructions on formatting and submitting system outputs.
Reproducibility and Resources
All datasets, scripts, and baseline systems will be released on:
- GitHub — evaluation scripts and scoring tools (primary development repository)
- Hugging Face — datasets and leaderboard
- Zenodo — versioned archival snapshot for citation (with DOI)
The leaderboard will remain active after ICDAR 2026 to support long-term benchmarking and community contributions.
Updates
Join the HIPE-OCRepair 2026 mailing list for further updates.