Tasks & Data

For a detailed description of the tasks and data, please refer to the HIPE-2022 Participation Guidelines and check the HIPE-2022-data repository.

Tasks

Task 1: Named Entity Recognition and Classification (NERC)

Subtask 1.1 - ‘NERC-Coarse’: recognition and classification of entity mentions according to coarse-grained types (task proposed for all languages and datasets).
Subtask 1.2 - ‘NERC-Fine’: recognition and classification of entity mentions according to fine-grained types (cf. column 2 in Table 2), plus the detection and classification of nested entities of depth 1 (task proposed for English, French and German for some datasets).

Task 2 : Named Entity Linking (EL)

Linking of NE mentions to a unique referent in Wikidata or to a NIL node if the mention does not have a referent in the KB. The entity linking task includes two settings: with (EL only) and without (end-to-end EL) prior knowledge of mention boundaries.

Data

HIPE-2022 datasets are based on six primary NE-annotated datasets assembled and prepared for the shared task. Primary datasets originate from several European cultural heritage projects, from HIPE organizers’ previous research project, and from the previous HIPE-2020 campaign. Some are already published, others are released for the first time for HIPE-2022.

Primary datasets are composed of historical newspapers and classical commentaries covering ca. 200 years; they feature several languages and were annotated with different entity tag sets and according to different annotation guidelines.

HIPE-2022 team assembles and prepares these primary datasets in HIPE-2022 release(s), which correspond to a single package composed of neatly structured and homogeneously formatted files. Primary datasets undergo the following preparation steps:

conversion to the HIPE format (with correction of data inconsistencies and metadata consolidation);
rearrangement or composition of train and dev splits.

Important: Teams cannot use any additional data from the primary data projects than the material available via HIPE-2022 train/sample/dev sets and released in the HIPE-2022-data repository for training their system. But they can use annotated data from any other project. The principles of trust and academic integrity apply.

Below is an overview table, check the generic HIPE-2022-data README for more information on version history, format, tagging scheme and mapping, as well as the participation guidelines.

HIPE-2022 data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Dataset alias	README	Document type	Languages	Suitable for	Project
ajmc	link	classical commentaries	de, fr, en	NERC-Coarse, NERC-Fine, EL	AjMC
hipe2020	link	historical newspapers	de, fr, en	NERC-Coarse, NERC-Fine, EL	CLEF-HIPE-2020
letemps	link	historical newspapers	fr	NERC-Coarse, NERC-Fine	LeTemps
topres19th	link	historical newspapers	en	NERC-Coarse, EL	Living with Machines
newseye	link	historical newspapers	de, fi, fr, sv	NERC-Coarse, NERC-Fine, EL	NewsEye
sonar	link	historical newspapers	de	NERC-Coarse, EL	SoNAR