Task Description

📩 Data repository (training data and documentation)
đŸ—’ïž Participation Guidelines
📈 HIPE-OCRepair-scorer repository
📊 Evaluation repository (for resuts after the competition)
🏆 Hugging Face Leaderboard (available soon)

The ICDAR 2026 Competition on LLM-Assisted OCR Post-Correction (HIPE-OCRepair) challenges participants to correct noisy OCR transcripts from multilingual historical documents.

Given an OCR segment and its metadata, systems must return an improved corrected version — without access to the source images.

The competition provides a multilingual benchmark with curated ground truth and a standardised evaluation protocol. Both generative and hybrid approaches are welcome, with a particular focus on LLM-assisted methods.

For each input text chunk (typically a paragraph-like unit), participants receive:

  • the raw OCR hypothesis
  • document-level metadata (language, publication title, type, date)
  • OCR quality indicators (CER, WER, lexicon-based quality score)

Systems are evaluated on their ability to reduce character error rate (CER). More details on metrics and evaluation infrastructure can be found on the âžĄïž Evaluation page.


Datasets

The benchmark comprises multilingual OCR post-correction data drawn from several historical collections, covering newspapers and printed works in English, French, and German (17th–20th century). All datasets were processed through a unified curation pipeline that standardises format, document unit segmentation and, as far as possible, transcription conventions, to ensure comparability across languages, time periods, and document types.

The benchmark includes training data (segmentation and formatting harmonisation only; original GT retained) as well as fully curated, manually corrected development and test sets.

👉 See the HIPE-OCRepair-2026-data repository to access the data and detailed documentation on the curation process.

📚 Source Collections

Dataset Curation Document Type Languages Period
DTA-19 (Deutsches Textarchiv) medium printed works de 17C–19C
impresso-nzz (Neue ZĂŒrcher Zeitung) light newspaper de 19C–20C
ICDAR-2017 (subsets) substantial newspaper fr, de 17C–20C
Overproof substantial newspaper en 19C–20C
Impresso snippets newly transcribed newspaper en, lu, fr, de 19C–20C

Input/Output Format

Each dataset is provided as a JSONL file with JSON documents following the hipe-ocrepair.schema.json schema, with four top-level fields

  • document_metadata: provenance and contextual information.
  • ocr_hypothesis: the OCR text to be corrected
  • ground_truth: the reference transcription (masked in test files).
  • ocr_postcorrection_output: the field to be filled by participant systems.

See more information on the Participation Guidelines.


Realistic Example from Historical Data

Below is a simplified illustration of the four components of the HIPE-OCRepair benchmark:
Original Ground Truth, Curated Ground Truth, OCR Output, and the Corrected Version a system should ideally produce.

Original Ground Truth Curated Ground Truth OCR Output Corrected Version
L’agence Havas nous transmet les dĂ© pĂȘches qui suivent
 L’agence Havas nous transmet les dĂ©pĂȘches qui suivent
 L’agence Havas nous transmet les dĂ© pĂȘches qui suivent
 L’agence Havas nous transmet les dĂ©pĂȘches qui suivent

Deux bataillons d’infanterie de La Manoubaont Ă©tĂ© envoyĂ©s
 Deux bataillons d’infanterie de La Manouba ont Ă©tĂ© envoyĂ©s
 deux bataillons d’infanterie do La Manoubaont Ă©tĂ© envoyĂ©s
 Deux bataillons d’infanterie de La Manouba ont Ă©tĂ© envoyĂ©s

Une vingtaine d’hommes sont descendus et se sont mis à la recherche du coupable
 Une vingtaine d’hommes
 à la recherche du coupable
 une vingtaine d’hom mes
 à la re cherche du coupable
 Une vingtaine d’hommes
 à la recherche du coupable

Aprùs quelques instants de recherches
 Aprùs quelques instants de recherches
 Aprùs quelques ins-tan de recherches
 Aprùs quelques instants de recherches

Les émissaires annoncent que la révolte est au camp tunisien
 Les émissaires annoncent que
 émissaires
 annon cent que
 Les émissaires annoncent que

refusent d’obĂ©ir.bĂ©ir. 
refusent d’obĂ©ir. 
refusent d’obĂ©ir.beir. 
refusent d’obĂ©ir.
Le gĂ©nĂ©ral Ben Turquia s’efforçait de calmer les mutins
 
s’efforçait de calmer les mutins
 s’ef-.forçait de calmer les mutins
 
s’efforçait de calmer les mutins


Questions?

Please post to HIPE-OCRepair 2026 mailing list.