This repository contains the ATLANTIS (coAsTaL mAritime NavigaTion InstructionS) dataset files and our benchmark results on this dataset.
This work is co-financed by the Shom and the IGN and is being carried out at the LASTIG, a research unit at Université Gustave Eiffel.
Under no circumstances should any part of this work be used for real-life navigation.
ATLANTIS is a French-language dataset for nested spatial entity and binary spatial relation extraction from text. The dataset is composed of extracts from 15 different volumes of the Instructions nautiques. The Instructions nautiques are a series of books produced and published by the French Naval Hydrographic and Oceanographic Service, the Shom. Each volume contains essential information for navigating safely in the coastal waters of a specific geographic area, including instructions for entering ports and descriptions of the coastal maritime environment. We extracted the text from the PDF documents using pdfminer.six. The volumes, which are written in French, cover coastal areas in Africa, Europe, North and South America, as well as in the Indian and Pacific Oceans.
The TextMine'24 Dataset is a subset of the ATLANTIS Dataset and contains nested spatial entity annotations. It was used as the benchmark dataset for the Défi TextMine 2024, a spatial entity recognition challenge hosted by the TextMine working group which is part of EGC, the International Francophone Association for Knowledge Extraction and Management.
We annotated our dataset by hand using the brat rapid annotation tool, which allows creating nested labelled annotations and creating directed labelled links between them. The source text was split on whitespace by brat, giving a dataset of 101,400 tokens.
We use three labels to annotate unnamed and named nested spatial entities:
- geographic feature refers to common nouns that represent types of spatial entities
- name refers to pure proper nouns
- geographic name refers to the full name associated with a geographic feature
Any token can be annotated with zero or one geographic feature or name label. A token cannot be annotated with both a geographic feature and a name label. A token cannot be annotated only with a name label. A token annotated with a name label must also be annotated one or more times with a geographic name label. Any token, already labelled or not, can be annotated zero or more times with a geographic name label.
We use 19 labels to annotate binary spatial relations:
- is off the coast of refers to an entity that is off the coast of another (it is always referred to using the same words in French in our dataset: "au large de")
- is marked by refers to an entity that is marked by another (usually a navigation mark such as a buoy or a beacon)
- is an element of refers to an entity that is situated on or in another such that a bird's eye view shows the spatial footprint of one as being within or partly within the other
- is [direction] of refers to an entity that is in any one of 16 cardinal directions with respect to another: the 16 relations include four that use the cardinal directions (N, E, S, W), four that use the intercardinal directions (NE, SE, SW, NW) and eight that use the secondary intercardinal directions (NNE, ENE, ESE, SSE, SSW, WSW, NWN, NNW)
All relation annotations must link two entity annotations with the exception of name entity labels, which cannot be part of relation annotations. All relation annotations must have a direction.
The atlantis_dataset_brat directory contains the .txt source files and the corresponding .ann annotation files for the entire dataset.
The atlantis_dataset.jsonl file contains the entire annotated dataset in a JSONL format. The specific JSONL format is that of the SCIERC dataset. We used this script, modified to use the fr_core_news_sm spaCy model for tokenisation, to convert files from the brat standoff format to JSONL.
The our_split_jsonl directory contains the train, development and test JSONL files that we used for our experiments. We split the atlantis_dataset.jsonl file into three parts: train, development and test according to a 80:10:10 ratio in terms of overall number of tokens and numbers of entity labels. We also ensured that text covering each geographic area was present in all three parts. Our dataset of 101,400 tokens contains 16,777 entity labels (which can span one or more tokens) and 3,051 relation labels (which connect exactly two entity labels, with a direction). In total, 18,030 tokens are annotated with at least one entity label, which corresponds to almost one in five tokens.
Train nb. | Dev nb. | Test nb. | Total nb. | |
---|---|---|---|---|
Tokens | 83,851 | 8,156 | 9,393 | 101,400 |
Unlabelled tokens | 69,200 | 6,507 | 7,663 | 83,370 |
Entity-labelled tokens | 14,651 | 1,649 | 1,730 | 18,030 |
Entity labels | 13,582 | 1,476 | 1,719 | 16,777 |
Relation labels | 2,507 | 222 | 322 | 3,051 |
Label | Train nb. | Dev nb. | Test nb. | Total nb. |
---|---|---|---|---|
geographic feature | 6,602 | 692 | 801 | 8,095 |
name | 3,486 | 391 | 462 | 4,339 |
geographic name | 3,494 | 393 | 456 | 4,343 |
Label | Train nb. | Dev nb. | Test nb. | Total nb. |
---|---|---|---|---|
is an element of | 1,300 | 109 | 190 | 1,599 |
is marked by | 143 | 13 | 17 | 173 |
is off the coast of | 21 | 1 | 1 | 23 |
is N of | 84 | 9 | 8 | 101 |
is NNE of | 46 | 1 | 3 | 50 |
is NE of | 47 | 1 | 5 | 53 |
is ENE of | 72 | 11 | 6 | 89 |
is E of | 92 | 6 | 11 | 109 |
is ESE of | 73 | 8 | 13 | 94 |
is SE of | 42 | 4 | 1 | 47 |
is SSE of | 51 | 10 | 3 | 64 |
is S of | 84 | 8 | 12 | 104 |
is SSW of | 86 | 6 | 7 | 99 |
is SW of | 45 | 2 | 5 | 52 |
is WSW of | 75 | 10 | 6 | 91 |
is W of | 75 | 10 | 11 | 96 |
is WNW of | 76 | 4 | 5 | 85 |
is NW of | 32 | 2 | 8 | 42 |
is NNW of | 63 | 7 | 10 | 80 |
We provide benchmark results for this dataset for three tasks: nested spatial entity extraction, binary spatial relation extraction, and end-to-end nested spatial entity and binary spatial relation extraction.
Our benchmark results were obtained using a supervised deep learning approach for automatic nested spatial entity and binary spatial relation extraction from text that we presented at the 7th ACM SIGSPATIAL International Workshop on Geospatial Humanities, held in conjunction with ACM SIGSPATIAL 2023. Our approach involves applying PURE (the Princeton University Relation Extraction system), made for flat, generic entity extraction and generic binary relation extraction, to the extraction of nested, spatial entities and spatial binary relations. The advantage of extracting nested spatial entities and the spatial relations between them is that it captures more information that can aid entity disambiguation.
This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013699 made by GENCI.
Hyperparameter | Entity Model | Relation Model |
---|---|---|
Learning rate | 1e-5 | 2e-5 |
Task learning rate | 5e-4 | - |
Train batch size | 16 | 32 |
Training epochs | 100 | 10 |
Mean micro F1-score with standard deviation over five runs for varying context window (
Context Window | Entity / Mono. | Entity / Multi. | Relation / Mono. | Relation / Multi. | e2e / Mono. | e2e / Multi. |
---|---|---|---|---|---|---|
|
91.1 ± 0.3 | 91.9 ± 0.2 | 64.2* ± 2.2 | 63.2 ± 1.0 | 63.9* ± 2.2 | 63.2 ± 1.2 |
|
92.1 ± 0.2 | 92.3 ± 0.3 | 64.2 ± 1.4 | 63.0 ± 1.7 | 63.8 ± 1.4 | 63.1 ± 1.7 |
|
91.9 ± 0.2 | 92.3 ± 0.2 | 63.7 ± 0.7 | 62.9 ± 0.7 | 63.6 ± 0.7 | 62.9 ± 0.8 |
|
91.9 ± 0.2 | 92.2 ± 0.2 | - | - | - | - |
|
92.0 ± 0.2 | 92.3* ± 0.2 | - | - | - | - |
|
92.2 ± 0.2 | 92.3 ± 0.2 | - | - | - | - |
[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for entity extraction for each entity label using the context window that gives the best overall results (
Label | Prec. / Mono. | Prec. / Multi. | Rec. / Mono. | Rec. / Multi. | F1 / Mono. | F1 / Multi. |
---|---|---|---|---|---|---|
all entity labels | 94.6 | 95.2 | 89.8 | 89.6 | 92.2 | 92.3 |
geographic feature | 94.1 | 94.4 | 95.8 | 95.1 | 95.0 | 94.8 |
name | 97.7 | 97.4 | 78.0 | 78.4 | 86.7 | 86.9 |
geographic name | 92.9 | 94.9 | 91.3 | 91.4 | 92.1 | 93.1 |
[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for relation extraction for each relation label from gold entity annotations using the context window that gives the best overall results (
Label | Prec. / Mono. | Prec. / Multi. | Rec. / Mono. | Rec. / Multi. | F1 / Mono. | F1 / Multi. |
---|---|---|---|---|---|---|
all relation labels | 70.8 | 67.2 | 58.8 | 59.9 | 64.2 | 63.2 |
is an element of | 70.5 | 67.9 | 60.2 | 60.3 | 64.9 | 63.8 |
is marked by | 64.9 | 55.1 | 51.8 | 49.4 | 57.5 | 50.8 |
is off the coast of | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
is N of | 48.6 | 39.8 | 52.5 | 50.0 | 49.9 | 44.2 |
is NNE of | 76.7 | 37.3 | 73.3 | 53.3 | 74.1 | 43.3 |
is NE of | 95.0 | 100.0 | 64.0 | 60.0 | 76.1 | 75.0 |
is ENE of | 83.0 | 93.3 | 63.3 | 83.3 | 71.6 | 87.9 |
is E of | 62.5 | 57.1 | 45.5 | 41.8 | 52.6 | 48.1 |
is ESE of | 73.4 | 71.1 | 55.4 | 66.2 | 62.6 | 67.8 |
is SE of | 90.0 | 60.0 | 100.0 | 100.0 | 93.3 | 73.3 |
is SSE of | 73.7 | 81.3 | 73.3 | 66.7 | 68.8 | 69.3 |
is S of | 84.7 | 82.1 | 71.7 | 81.7 | 77.3 | 81.1 |
is SSW of | 87.4 | 74.1 | 68.6 | 85.7 | 76.0 | 79.3 |
is SW of | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
is WSW of | 92.0 | 83.6 | 66.7 | 70.0 | 77.1 | 75.3 |
is W of | 79.3 | 80.4 | 65.5 | 60.0 | 71.3 | 68.0 |
is WNW of | 100.0 | 89.3 | 88.0 | 88.0 | 93.3 | 87.9 |
is NW of | 67.3 | 87.4 | 35.0 | 50.0 | 45.7 | 63.0 |
is NNW of | 61.7 | 64.1 | 44.0 | 40.0 | 51.1 | 49.0 |
[Prec.|Rec.|F1] / [Mono.|Multi] gives the mean [precision|recall|micro F1-score] for the [monolingual|multilingual] model over five runs for end-to-end entity and relation extraction for each relation label from the best predicted entity annotations using the context window that gives the best overall results (
Label | Prec. / Mono. | Prec. / Multi. | Rec. / Mono. | Rec. / Multi. | F1 / Mono. | F1 / Multi. |
---|---|---|---|---|---|---|
all relation labels | 70.2 | 67.3 | 58.8 | 59.8 | 63.9 | 63.2 |
is an element of | 70.2 | 68.2 | 60.2 | 60.2 | 64.8 | 63.9 |
is marked by | 61.3 | 55.1 | 51.8 | 49.4 | 56.1 | 50.8 |
is off the coast of | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
is N of | 47.7 | 41.8 | 52.5 | 50.0 | 49.7 | 45.2 |
is NNE of | 76.7 | 39.3 | 73.3 | 53.3 | 74.1 | 44.8 |
is NE of | 95.0 | 100.0 | 64.0 | 60.0 | 76.1 | 75.0 |
is ENE of | 96.0 | 93.3 | 63.3 | 83.3 | 75.9 | 87.9 |
is E of | 67.9 | 57.1 | 45.5 | 41.8 | 54.4 | 48.1 |
is ESE of | 70.9 | 71.1 | 55.4 | 66.2 | 61.6 | 67.8 |
is SE of | 90.0 | 60.0 | 100.0 | 100.0 | 93.3 | 73.3 |
is SSE of | 71.7 | 81.3 | 73.3 | 66.7 | 67.1 | 69.3 |
is S of | 84.7 | 82.1 | 71.7 | 81.7 | 77.3 | 81.1 |
is SSW of | 87.4 | 74.1 | 68.6 | 85.7 | 76.0 | 79.3 |
is SW of | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
is WSW of | 92.0 | 83.6 | 66.7 | 70.0 | 77.1 | 75.3 |
is W of | 75.8 | 78.9 | 65.5 | 60.0 | 69.5 | 67.3 |
is WNW of | 100.0 | 89.3 | 88.0 | 88.0 | 93.3 | 87.9 |
is NW of | 80.0 | 87.4 | 35.0 | 50.0 | 47.0 | 63.0 |
is NNW of | 63.5 | 64.1 | 44.0 | 40.0 | 51.8 | 49.0 |