Evaluation Infrastructure

The HIPE-2026 evaluation infrastructure is designed to ensure fair, transparent, and reproducible assessment across both accuracy- and efficiency-focused system submissions. Participants will submit predictions on two blind test sets using a standardized format. Official scoring scripts will compute core classification metrics — including Precision, Recall, Accuracy, and F1 — for each relation type (at, isAt), with macro-averaging at the document level to ensure comparability across documents.

Efficiency-related factors — such as model size, inference time, and hardware usage — will be collected via metadata forms submitted by participants. These indicators will be used to rank systems within the efficiency profile, where cost-effective, scalable approaches are encouraged.

All evaluation tools, baseline resources, and submission templates will be made publicly available to support reproducibility and broad participation.

📢 Announcement:
The official GitHub repository for HIPE-2026 evaluation scripts and starter code will be announced soon. Please check back or follow updates via the mailing list.