Tasks & Data
Task Description
đŠ Data repository (training data and documentation)
đïž Participation Guidelines
đ HIPE-OCRepair-scorer repository
đ Evaluation repository (for resuts after the competition)
đ Hugging Face Leaderboard (available soon)
The ICDAR 2026 Competition on LLM-Assisted OCR Post-Correction (HIPE-OCRepair) challenges participants to correct noisy OCR transcripts from multilingual historical documents.
Given an OCR segment and its metadata, systems must return an improved corrected version â without access to the source images.
The competition provides a multilingual benchmark with curated ground truth and a standardised evaluation protocol. Both generative and hybrid approaches are welcome, with a particular focus on LLM-assisted methods.
For each input text chunk (typically a paragraph-like unit), participants receive:
- the raw OCR hypothesis
- document-level metadata (language, publication title, type, date)
- OCR quality indicators (CER, WER, lexicon-based quality score)
Systems are evaluated on their ability to reduce character error rate (CER). More details on metrics and evaluation infrastructure can be found on the âĄïž Evaluation page.
Datasets
The benchmark comprises multilingual OCR post-correction data drawn from several historical collections, covering newspapers and printed works in English, French, and German (17thâ20th century). All datasets were processed through a unified curation pipeline that standardises format, document unit segmentation and, as far as possible, transcription conventions, to ensure comparability across languages, time periods, and document types.
The benchmark includes training data (segmentation and formatting harmonisation only; original GT retained) as well as fully curated, manually corrected development and test sets.
đ See the HIPE-OCRepair-2026-data repository to access the data and detailed documentation on the curation process.
đ Source Collections
| Dataset | Curation | Document Type | Languages | Period |
|---|---|---|---|---|
| DTA-19 (Deutsches Textarchiv) | medium | printed works | de | 17Câ19C |
| impresso-nzz (Neue ZĂŒrcher Zeitung) | light | newspaper | de | 19Câ20C |
| ICDAR-2017 (subsets) | substantial | newspaper | fr, de | 17Câ20C |
| Overproof | substantial | newspaper | en | 19Câ20C |
| Impresso snippets | newly transcribed | newspaper | en, lu, fr, de | 19Câ20C |
Input/Output Format
Each dataset is provided as a JSONL file with JSON documents following the hipe-ocrepair.schema.json schema, with four top-level fields
- document_metadata: provenance and contextual information.
- ocr_hypothesis: the OCR text to be corrected
- ground_truth: the reference transcription (masked in test files).
- ocr_postcorrection_output: the field to be filled by participant systems.
See more information on the Participation Guidelines.
Realistic Example from Historical Data
Below is a simplified illustration of the four components of the HIPE-OCRepair
benchmark:
Original Ground Truth, Curated Ground Truth, OCR Output, and the
Corrected Version a system should ideally produce.
| Original Ground Truth | Curated Ground Truth | OCR Output | Corrected Version |
|---|---|---|---|
| Lâagence Havas nous transmet les dĂ© pĂȘches qui suivent⊠| Lâagence Havas nous transmet les dĂ©pĂȘches qui suivent⊠| Lâagence Havas nous transmet les dĂ© pĂȘches qui suivent⊠| Lâagence Havas nous transmet les dĂ©pĂȘches qui suivent⊠|
| Deux bataillons dâinfanterie de La Manoubaont Ă©tĂ© envoyĂ©s⊠| Deux bataillons dâinfanterie de La Manouba ont Ă©tĂ© envoyĂ©s⊠| deux bataillons dâinfanterie do La Manoubaont Ă©tĂ© envoyĂ©s⊠| Deux bataillons dâinfanterie de La Manouba ont Ă©tĂ© envoyĂ©s⊠|
| Une vingtaine dâhommes sont descendus et se sont mis Ă la recherche du coupable⊠| Une vingtaine dâhommes⊠à la recherche du coupable⊠| une vingtaine dâhom mes⊠à la re cherche du coupable⊠| Une vingtaine dâhommes⊠à la recherche du coupable⊠|
| AprÚs quelques instants de recherches⊠| AprÚs quelques instants de recherches⊠| AprÚs quelques ins-tan de recherches⊠| AprÚs quelques instants de recherches⊠|
| Les émissaires annoncent que la révolte est au camp tunisien⊠| Les émissaires annoncent que⊠| émissaires⊠annon cent que⊠| Les émissaires annoncent que⊠|
| refusent dâobĂ©ir.bĂ©ir. | âŠrefusent dâobĂ©ir. | âŠrefusent dâobĂ©ir.beir. | âŠrefusent dâobĂ©ir. |
| Le gĂ©nĂ©ral Ben Turquia sâefforçait de calmer les mutins⊠| âŠsâefforçait de calmer les mutins⊠| sâef-.forçait de calmer les mutins⊠| âŠsâefforçait de calmer les mutins⊠|
Questions?
Please post to HIPE-OCRepair 2026 mailing list.