Code Monkey home page Code Monkey logo

atlantis-dataset's Introduction

ATLANTIS Dataset

This repository contains the ATLANTIS (coAsTaL mAritime NavigaTion InstructionS) dataset files and our benchmark results on this dataset.

This work is co-financed by the Shom and the IGN and is being carried out at the LASTIG, a research unit at Université Gustave Eiffel.

Under no circumstances should any part of this work be used for real-life navigation.

Overview

ATLANTIS is a French-language dataset for nested spatial entity and binary spatial relation extraction from text. The dataset is composed of extracts from 15 different volumes of the Instructions nautiques. The Instructions nautiques are a series of books produced and published by the French Naval Hydrographic and Oceanographic Service, the Shom. Each volume contains essential information for navigating safely in the coastal waters of a specific geographic area, including instructions for entering ports and descriptions of the coastal maritime environment. We extracted the text from the PDF documents using pdfminer.six. The volumes, which are written in French, cover coastal areas in Africa, Europe, North and South America, as well as in the Indian and Pacific Oceans.

The TextMine'24 Dataset is a subset of the ATLANTIS Dataset and contains nested spatial entity annotations. It was used as the benchmark dataset for the Défi TextMine 2024, a spatial entity recognition challenge hosted by the TextMine working group which is part of EGC, the International Francophone Association for Knowledge Extraction and Management.

Annotation Scheme

We annotated our dataset by hand using the brat rapid annotation tool, which allows creating nested labelled annotations and creating directed labelled links between them. The source text was split on whitespace by brat, giving a dataset of 101,400 tokens.

Nested Spatial Entities

We use three labels to annotate unnamed and named nested spatial entities:

  • geographic feature refers to common nouns that represent types of spatial entities
  • name refers to pure proper nouns
  • geographic name refers to the full name associated with a geographic feature

Any token can be annotated with zero or one geographic feature or name label. A token cannot be annotated with both a geographic feature and a name label. A token cannot be annotated only with a name label. A token annotated with a name label must also be annotated one or more times with a geographic name label. Any token, already labelled or not, can be annotated zero or more times with a geographic name label.

Binary Spatial Relations

We use 19 labels to annotate binary spatial relations:

  • is off the coast of refers to an entity that is off the coast of another (it is always referred to using the same words in French in our dataset: "au large de")
  • is marked by refers to an entity that is marked by another (usually a navigation mark such as a buoy or a beacon)
  • is an element of refers to an entity that is situated on or in another such that a bird's eye view shows the spatial footprint of one as being within or partly within the other
  • is [direction] of refers to an entity that is in any one of 16 cardinal directions with respect to another: the 16 relations include four that use the cardinal directions (N, E, S, W), four that use the intercardinal directions (NE, SE, SW, NW) and eight that use the secondary intercardinal directions (NNE, ENE, ESE, SSE, SSW, WSW, NWN, NNW)

All relation annotations must link two entity annotations with the exception of name entity labels, which cannot be part of relation annotations. All relation annotations must have a direction.

Repository Content

The atlantis_dataset_brat directory contains the .txt source files and the corresponding .ann annotation files for the entire dataset.

The atlantis_dataset.jsonl file contains the entire annotated dataset in a JSONL format. The specific JSONL format is that of the SCIERC dataset. We used this script, modified to use the fr_core_news_sm spaCy model for tokenisation, to convert files from the brat standoff format to JSONL.

The our_split_jsonl directory contains the train, development and test JSONL files that we used for our experiments. We split the atlantis_dataset.jsonl file into three parts: train, development and test according to a 80:10:10 ratio in terms of overall number of tokens and numbers of entity labels. We also ensured that text covering each geographic area was present in all three parts. Our dataset of 101,400 tokens contains 16,777 entity labels (which can span one or more tokens) and 3,051 relation labels (which connect exactly two entity labels, with a direction). In total, 18,030 tokens are annotated with at least one entity label, which corresponds to almost one in five tokens.

Dataset Composition

Train nb. Dev nb. Test nb. Total nb.
Tokens 83,851 8,156 9,393 101,400
Unlabelled tokens 69,200 6,507 7,663 83,370
Entity-labelled tokens 14,651 1,649 1,730 18,030
Entity labels 13,582 1,476 1,719 16,777
Relation labels 2,507 222 322 3,051

Entity Label Count

Label Train nb. Dev nb. Test nb. Total nb.
geographic feature 6,602 692 801 8,095
name 3,486 391 462 4,339
geographic name 3,494 393 456 4,343

Relation Label Count

Label Train nb. Dev nb. Test nb. Total nb.
is an element of 1,300 109 190 1,599
is marked by 143 13 17 173
is off the coast of 21 1 1 23
is N of 84 9 8 101
is NNE of 46 1 3 50
is NE of 47 1 5 53
is ENE of 72 11 6 89
is E of 92 6 11 109
is ESE of 73 8 13 94
is SE of 42 4 1 47
is SSE of 51 10 3 64
is S of 84 8 12 104
is SSW of 86 6 7 99
is SW of 45 2 5 52
is WSW of 75 10 6 91
is W of 75 10 11 96
is WNW of 76 4 5 85
is NW of 32 2 8 42
is NNW of 63 7 10 80

Benchmark Results

We provide benchmark results for this dataset for three tasks: nested spatial entity extraction, binary spatial relation extraction, and end-to-end nested spatial entity and binary spatial relation extraction.

Our benchmark results were obtained using a supervised deep learning approach for automatic nested spatial entity and binary spatial relation extraction from text that we presented at the 7th ACM SIGSPATIAL International Workshop on Geospatial Humanities, held in conjunction with ACM SIGSPATIAL 2023. Our approach involves applying PURE (the Princeton University Relation Extraction system), made for flat, generic entity extraction and generic binary relation extraction, to the extraction of nested, spatial entities and spatial binary relations. The advantage of extracting nested spatial entities and the spatial relations between them is that it captures more information that can aid entity disambiguation.

This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013699 made by GENCI.

Hyperparameters

Hyperparameter Entity Model Relation Model
Learning rate 1e-5 2e-5
Task learning rate 5e-4 -
Train batch size 16 32
Training epochs 100 10

Results

F1-Scores for Varying Context Window Sizes

Mean micro F1-score with standard deviation over five runs for varying context window ($W$) sizes for: entity extraction, relation extraction from gold entity annotations, and end-to-end entity and relation extraction (e2e) from best predicted entity annotations ($W$ = 248 for monolingual, $W$ = 200 for multilingual). For each task, the highest F1-score over all context window sizes for each base encoder is in bold, and the overall highest F1-score over all context window sizes and both base encoders is indicated with an asterisk (*).

Context Window Entity / Mono. Entity / Multi. Relation / Mono. Relation / Multi. e2e / Mono. e2e / Multi.
$W$=0 91.1 ± 0.3 91.9 ± 0.2 64.2* ± 2.2 63.2 ± 1.0 63.9* ± 2.2 63.2 ± 1.2
$W$=50 92.1 ± 0.2 92.3 ± 0.3 64.2 ± 1.4 63.0 ± 1.7 63.8 ± 1.4 63.1 ± 1.7
$W$=100 91.9 ± 0.2 92.3 ± 0.2 63.7 ± 0.7 62.9 ± 0.7 63.6 ± 0.7 62.9 ± 0.8
$W$=150 91.9 ± 0.2 92.2 ± 0.2 - - - -
$W$=200 92.0 ± 0.2 92.3* ± 0.2 - - - -
$W$=248 92.2 ± 0.2 92.3 ± 0.2 - - - -

Detailed Nested Entity Extraction Results

[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for entity extraction for each entity label using the context window that gives the best overall results ($W$ = 248 for monolingual, $W$ = 200 for multilingual). For each task, the highest precision, recall and micro F1-score for all entity labels and for all relation labels over both base encoders is in bold.

Label Prec. / Mono. Prec. / Multi. Rec. / Mono. Rec. / Multi. F1 / Mono. F1 / Multi.
all entity labels 94.6 95.2 89.8 89.6 92.2 92.3
geographic feature 94.1 94.4 95.8 95.1 95.0 94.8
name 97.7 97.4 78.0 78.4 86.7 86.9
geographic name 92.9 94.9 91.3 91.4 92.1 93.1

Detailed Binary Relation Extraction Results

[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for relation extraction for each relation label from gold entity annotations using the context window that gives the best overall results ($W$ = 0 for monolingual and for multilingual). For each task, the highest precision, recall and micro F1-score for all entity labels and for all relation labels over both base encoders is in bold.

Label Prec. / Mono. Prec. / Multi. Rec. / Mono. Rec. / Multi. F1 / Mono. F1 / Multi.
all relation labels 70.8 67.2 58.8 59.9 64.2 63.2
is an element of 70.5 67.9 60.2 60.3 64.9 63.8
is marked by 64.9 55.1 51.8 49.4 57.5 50.8
is off the coast of 100.0 100.0 100.0 100.0 100.0 100.0
is N of 48.6 39.8 52.5 50.0 49.9 44.2
is NNE of 76.7 37.3 73.3 53.3 74.1 43.3
is NE of 95.0 100.0 64.0 60.0 76.1 75.0
is ENE of 83.0 93.3 63.3 83.3 71.6 87.9
is E of 62.5 57.1 45.5 41.8 52.6 48.1
is ESE of 73.4 71.1 55.4 66.2 62.6 67.8
is SE of 90.0 60.0 100.0 100.0 93.3 73.3
is SSE of 73.7 81.3 73.3 66.7 68.8 69.3
is S of 84.7 82.1 71.7 81.7 77.3 81.1
is SSW of 87.4 74.1 68.6 85.7 76.0 79.3
is SW of 0.0 0.0 0.0 0.0 0.0 0.0
is WSW of 92.0 83.6 66.7 70.0 77.1 75.3
is W of 79.3 80.4 65.5 60.0 71.3 68.0
is WNW of 100.0 89.3 88.0 88.0 93.3 87.9
is NW of 67.3 87.4 35.0 50.0 45.7 63.0
is NNW of 61.7 64.1 44.0 40.0 51.1 49.0

Detailed End-to-End Entity and Relation Extraction Results

[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for end-to-end entity and relation extraction for each relation label from the best predicted entity annotations using the context window that gives the best overall results ($W$ = 0 for monolingual and for multilingual). The highest precision, recall and F1-score for all relation labels over both base encoders is in bold.

Label Prec. / Mono. Prec. / Multi. Rec. / Mono. Rec. / Multi. F1 / Mono. F1 / Multi.
all relation labels 70.2 67.3 58.8 59.8 63.9 63.2
is an element of 70.2 68.2 60.2 60.2 64.8 63.9
is marked by 61.3 55.1 51.8 49.4 56.1 50.8
is off the coast of 100.0 100.0 100.0 100.0 100.0 100.0
is N of 47.7 41.8 52.5 50.0 49.7 45.2
is NNE of 76.7 39.3 73.3 53.3 74.1 44.8
is NE of 95.0 100.0 64.0 60.0 76.1 75.0
is ENE of 96.0 93.3 63.3 83.3 75.9 87.9
is E of 67.9 57.1 45.5 41.8 54.4 48.1
is ESE of 70.9 71.1 55.4 66.2 61.6 67.8
is SE of 90.0 60.0 100.0 100.0 93.3 73.3
is SSE of 71.7 81.3 73.3 66.7 67.1 69.3
is S of 84.7 82.1 71.7 81.7 77.3 81.1
is SSW of 87.4 74.1 68.6 85.7 76.0 79.3
is SW of 0.0 0.0 0.0 0.0 0.0 0.0
is WSW of 92.0 83.6 66.7 70.0 77.1 75.3
is W of 75.8 78.9 65.5 60.0 69.5 67.3
is WNW of 100.0 89.3 88.0 88.0 93.3 87.9
is NW of 80.0 87.4 35.0 50.0 47.0 63.0
is NNW of 63.5 64.1 44.0 40.0 51.8 49.0

atlantis-dataset's People

Contributors

hm-raws avatar

Stargazers

 avatar

Watchers

Paul CHAPRON avatar Marie-Dominique Van Damme avatar  avatar Samuel Mermet avatar Sido avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.