Code Monkey home page Code Monkey logo

ajmc-ne-corpus's Introduction

AjMC NE corpus

This dataset consists of named entity-annotated historical commentaries in the field of Classics. The annotated entities feature a few domain-specific entity types such as works, material objects (e.g. manuscripts) and bibliographic references, in addition to more universal named entities like persons, locations, organizations and dates.

Dataset profile

Document type scholarly commentaries (19C)
Languages English, French, German, Ancient Greek, Latin
Annotation guidelines DOI
Annotation tool INCEpTION
Original format and tagging scheme HIPE TSV format, IOB
Annotations NERC, EL (towards Wikidata)
Version v0.4
Related publication A Named Entity-Annotated Corpus of 19th Century Classical Commentaries
License License: CC BY-NC-SA 4.0

Entity tagset

List of annotated entities (coarse level):

  • Person* (pers)
  • Location (loc)
  • Organisation (org)
  • Date (date)
  • Work* (work)
  • Scope (scope)
  • Object* (object)

Entities marked with an asterisk (*) are further classified into sub-types. For example, a person entity can be: a) mythological entity (pers.myth), b) author (pers.author), c) editor (pers.editor) or d) other (pers.other). See the annotation guidelines for the full list of entity sub-types.

Format

This dataset comes in the CoNLL-like HIPE TSV format (for further details see the HIPE 2020 Task Participation Guidelines, p. 8). Sentence boundaries are indicated by the EndOfLine flag, contained in the MISC column, and correspond to manually identified linguistic sentences (see Guidelines, section 4). Hyphenated words were manually identified and re-composed (i.e. de-hyphenated).

Annotated data come in two flavours, corresponding to two different sets of tasks:

  1. NER and EL: data contains annotations of universal entities, both coarse and fine grained, as well as entity links. See sample file (English).
  2. Citation mining (files with _biblio prefix in the name): data contains annotations of bibliographic references to both primary and secondary sources, according to the taxonomy described in the Annotation Guidelines section 2.3.

NB: the two files are fully aligned, meaning that line n in both files will refer to the same annotated token. As such, information from both files can be combined together and used in multi-task learning scenarios.

Related resources

Hucitlib Knowledge Base. Commentators make abundant use of very concise abbreviations when referring e.g. to ancient authors (pers.author) and their works (work.primlit). Such abbreviations constitute a substantial challenge, especially for entity linking. An external resource that can be used in this respect is the hucitlib knowledge base which is partially linked to Wikidata and provides abbreviations and variant names/titles for classical authors and their works.

Citation mining. The dataset Annotated References in the Historiography on Venice: 19th–21st centuries, despite originating from a slightly different domain (i.e. history of Venice), contains annotations of primary and secondary bibliographic references. The guidelines according to which it was annotated are compatible with our guidelines for bibliographic entities.

License

The digitized commentaries are available in the Internet Archive and released in the Public Domain. This annotated dataset is published under a Creative Commons CC BY license (v. 4.0).

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Kevin Duc (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL).

ajmc-ne-corpus's People

Contributors

mromanello avatar

ajmc-ne-corpus's Issues

pending tasks for full data release

  • fix tokenization #1
  • fix #5
  • add levensthein distance to all datasets (LED flag) – except the mask test files
  • update the stats notebook in HIPE-2022-data
  • output a list of entity surface / gold transcript / entity type / QID as a separate file (remove entities belonging to test set!)
  • fix #6
  • revise guidelines (accept changes etc.) + publish on Zenodo
  • update README for dataset in HIPE2022-data/documentation

prepare repo for public release

in the perspective of a data paper and of making the repository public, we need to:

  • write a proper README (license, grant number, contributors, DOI)
  • finish revision of data (e.g. some entity links for Sophocles not pointing to right entity)
  • finish curation on the basis of double annotation
    • DE
    • FR
    • EN
  • fix pending issue like #9

prepare docs for double-annotation

  • all docs to be double-annotated have been selected (see document-selection spreadsheet)
  • all docs were double checked for OLR problems
  • remaining tasks:
    • annotate sentences and hyphenation on EN and DE docs
    • import docs into a new INCEpTION project
    • import KD's annotations into the dedicated project (only for FR docs)

improve retokenization of tokens w/ multiple punctuation signs

file lib/retokenization.py

Tokens that contain multiple punctuation signs are not tokenised correctly

(	O	_	O	_	_	O	_	_	NoSpaceAfter
supr.	O	_	O	_	_	O	_	_	_
165	B-scope	_	B-scope	_	_	O	_	_	InPrimaryReference
foll.)	I-scope	_	I-scope	_	_	O	_	_	InPrimaryReference|NoSpaceAfter|Partial--4:5
,	O	_	O	_	_	O	_	_	_

Following tokens are expected:

(
supr
.
165
foll
.
)
,

comply to final HIPE data format

Changes to implement:

  • naming of files (e.g. HIPE-2022-v1.0-ajmc-train-de.tsv)
  • move the dataset's version number to the document metadata, and remove from file name
  • add namespaces to document metadata (TBC)
  • change EndOfLine to EndOfSentence (because that's what it is)
  • add language metadata

missing noisy OCR transcript in FR docs

  • 2022-03-08 12:09:38,947 - root - ERROR - Transcript for noisy entity 4295 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0068.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:39,059 - root - ERROR - Transcript for noisy entity Odyssée, 1IL, 2614 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0069.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:39,785 - root - ERROR - Transcript for noisy entity τ. 785-786 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0076.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:39,785 - root - ERROR - Transcript for noisy entity OEd. Οοΐ. 484 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0076.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:39,918 - root - ERROR - Transcript for noisy entity 469 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0077.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:41,424 - root - ERROR - Transcript for noisy entity Euripide, Androm.: 1224 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0090.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:41,813 - root - ERROR - Transcript for noisy entity page 4150, remarque 4 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0093.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:45,359 - root - ERROR - Transcript for noisy entity tome IV, page 1084 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0122.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:47,705 - root - ERROR - Transcript for noisy entity pege 262 F is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0143.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:47,824 - root - ERROR - Transcript for noisy entity vers650 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0144.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:48,057 - root - ERROR - Transcript for noisy entity 1376 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0146.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:48,186 - root - ERROR - Transcript for noisy entity 699 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0147.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:48,453 - root - ERROR - Transcript for noisy entity vers 4323 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0149.xmi. Levenshtein distance cannot be computed and is set to 0.
  • 2022-03-08 12:09:48,966 - root - ERROR - Transcript for noisy entity vers 4410 is missing in data/preparation/corpus/fr/retokenized/lestragdiesdeso00tourgoog_0154.xmi. Levenshtein distance cannot be computed and is set to 0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.