AjMC NE corpus

This dataset consists of named entity-annotated historical commentaries in the field of Classics. The annotated entities feature a few domain-specific entity types such as works, material objects (e.g. manuscripts) and bibliographic references, in addition to more universal named entities like persons, locations, organizations and dates.

Dataset profile


Document type	scholarly commentaries (19C)
Languages	English, French, German, Ancient Greek, Latin
Annotation guidelines
Annotation tool	INCEpTION
Original format and tagging scheme	`HIPE TSV format, IOB`
Annotations	NERC, EL (towards Wikidata)
Version	`v0.4`
Related publication	A Named Entity-Annotated Corpus of 19^th Century Classical Commentaries
License

Entity tagset

List of annotated entities (coarse level):

Person* (pers)
Location (loc)
Organisation (org)
Date (date)
Work* (work)
Scope (scope)
Object* (object)

Entities marked with an asterisk (*) are further classified into sub-types. For example, a person entity can be: a) mythological entity (pers.myth), b) author (pers.author), c) editor (pers.editor) or d) other (pers.other). See the annotation guidelines for the full list of entity sub-types.

Format

This dataset comes in the CoNLL-like HIPE TSV format (for further details see the HIPE 2020 Task Participation Guidelines, p. 8). Sentence boundaries are indicated by the EndOfLine flag, contained in the MISC column, and correspond to manually identified linguistic sentences (see Guidelines, section 4). Hyphenated words were manually identified and re-composed (i.e. de-hyphenated).

Annotated data come in two flavours, corresponding to two different sets of tasks:

NER and EL: data contains annotations of universal entities, both coarse and fine grained, as well as entity links. See sample file (English).
Citation mining (files with _biblio prefix in the name): data contains annotations of bibliographic references to both primary and secondary sources, according to the taxonomy described in the Annotation Guidelines section 2.3.

NB: the two files are fully aligned, meaning that line n in both files will refer to the same annotated token. As such, information from both files can be combined together and used in multi-task learning scenarios.

Related resources

Hucitlib Knowledge Base. Commentators make abundant use of very concise abbreviations when referring e.g. to ancient authors (pers.author) and their works (work.primlit). Such abbreviations constitute a substantial challenge, especially for entity linking. An external resource that can be used in this respect is the hucitlib knowledge base which is partially linked to Wikidata and provides abbreviations and variant names/titles for classical authors and their works.

Citation mining. The dataset Annotated References in the Historiography on Venice: 19th–21st centuries, despite originating from a slightly different domain (i.e. history of Venice), contains annotations of primary and secondary bibliographic references. The guidelines according to which it was annotated are compatible with our guidelines for bibliographic entities.

License

The digitized commentaries are available in the Internet Archive and released in the Public Domain. This annotated dataset is published under a Creative Commons CC BY license (v. 4.0).

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Kevin Duc (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL).

missing noisy OCR transcript in FR docs

2022-03-08 12:09:38,947 - root - ERROR - Transcript for noisy entity 4295 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0068.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:39,059 - root - ERROR - Transcript for noisy entity Odyssée, 1IL, 2614 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0069.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:39,785 - root - ERROR - Transcript for noisy entity τ. 785-786 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0076.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:39,785 - root - ERROR - Transcript for noisy entity OEd. Οοΐ. 484 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0076.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:39,918 - root - ERROR - Transcript for noisy entity 469 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0077.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:41,424 - root - ERROR - Transcript for noisy entity Euripide, Androm.: 1224 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0090.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:41,813 - root - ERROR - Transcript for noisy entity page 4150, remarque 4 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0093.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:45,359 - root - ERROR - Transcript for noisy entity tome IV, page 1084 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0122.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:47,705 - root - ERROR - Transcript for noisy entity pege 262 F is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0143.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:47,824 - root - ERROR - Transcript for noisy entity vers650 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0144.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:48,057 - root - ERROR - Transcript for noisy entity 1376 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0146.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:48,186 - root - ERROR - Transcript for noisy entity 699 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0147.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:48,453 - root - ERROR - Transcript for noisy entity vers 4323 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0149.xmi. Levenshtein distance cannot be computed and is set to 0.
2022-03-08 12:09:48,966 - root - ERROR - Transcript for noisy entity vers 4410 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0154.xmi. Levenshtein distance cannot be computed and is set to 0.

ajaxmulticommentary / ajmc-ne-corpus Goto Github PK

ajmc-ne-corpus's Introduction

AjMC NE corpus

Dataset profile

Entity tagset

Format

Related resources

License

Acknowledgements

ajmc-ne-corpus's People

Contributors

ajmc-ne-corpus's Issues

Recommend Projects

Recommend Topics

Recommend Org