For a detailed description of the tasks and data, please refer to the HIPE-2022 Participation Guidelines and check the HIPE-2022-data repository.

Tasks

Task 1: Named Entity Recognition and Classification (NERC)

  • Subtask 1.1 - ‘NERC-Coarse’: recognition and classification of entity mentions according to coarse-grained types (task proposed for all languages and datasets).
  • Subtask 1.2 - ‘NERC-Fine’: recognition and classification of entity mentions according to fine-grained types (cf. column 2 in Table 2), plus the detection and classification of nested entities of depth 1 (task proposed for English, French and German for some datasets).

Task 2 : Named Entity Linking (EL)

Linking of NE mentions to a unique referent in Wikidata or to a NIL node if the mention does not have a referent in the KB. The entity linking task includes two settings: with (EL only) and without (end-to-end EL) prior knowledge of mention boundaries.

Data

HIPE-2022 datasets are based on six primary NE-annotated datasets assembled and prepared for the shared task. Primary datasets originate from several European cultural heritage projects, from HIPE organizers’ previous research project, and from the previous HIPE-2020 campaign. Some are already published, others are released for the first time for HIPE-2022.

Primary datasets are composed of historical newspapers and classical commentaries covering ca. 200 years; they feature several languages and were annotated with different entity tag sets and according to different annotation guidelines.

HIPE-2022 team assembles and prepares these primary datasets in HIPE-2022 release(s), which correspond to a single package composed of neatly structured and homogeneously formatted files. Primary datasets undergo the following preparation steps:

  • conversion to the HIPE format (with correction of data inconsistencies and metadata consolidation);
  • rearrangement or composition of train and dev splits.

Important: Teams cannot use any additional data from the primary data projects than the material available via HIPE-2022 train/sample/dev sets and released in the HIPE-2022-data repository for training their system. But they can use annotated data from any other project. The principles of trust and academic integrity apply.

Below is an overview table, check the generic HIPE-2022-data README for more information on version history, format, tagging scheme and mapping, as well as the participation guidelines.

HIPE-2022 data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. CC BY-NC-SA 4.0

Dataset alias README Document type Languages Suitable for Project License
ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC License: CC BY 4.0
hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 License: CC BY-NC-SA 4.0
letemps link historical newspapers fr NERC-Coarse, NERC-Fine LeTemps License: CC BY-NC-SA 4.0
topres19th link historical newspapers en NERC-Coarse, EL Living with Machines License: CC BY-NC-SA 4.0
newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye License: CC BY 4.0
sonar link historical newspapers de NERC-Coarse, EL SoNAR License: CC BY 4.0