Code Monkey home page Code Monkey logo

jados's Introduction

JADOS

JADOS is a Japanese document-level text simplification dataset for the news and encyclopedia domains, as described in "A Document-Level Text Simplification Dataset for Japanese."

Data structure

Each domain dataset is provided in JSON format.

  • Mainichi corpus in News doamin: data/mainichi_corpus/mainichi_vX.X.X.json
  • Wikipedia corpus in encyclopedia doamin: data/wikipedia_corpus/wikipedia_vX.X.X.json

Mainichi corpus

Note: To obtain the Mainichi corpus source and target texts, please purchase the 毎日新聞記事データ集 and 毎日小学生新聞記事データ集 corpora from 2013 to 2020 [link].

Each entry consists of the following objects.

┌──── year
├──── source
│         ├──── id
│         └──── text
├──── target
│         ├──── id
│         └──── text
└──── annotations
          ├──── alignment_ids
          └──── simplification_labels
Key Type Description
year str Publication year of Mainichi Japanese Daily Newspaper and Mainichi Elementary School Newspaper articles.
source dict Mainichi Japanese Daily Newspaper article information extracted from data collection.
target dict Mainichi Elementary School Newspaper article information extracted from data collection.
id str Index article number of source/target article. Extracted from the text annotated with the \C0\ tag in data collection.
text list[str|int] List of starting positions (character count) for each sentence of source/target full article. In the case of a source article, the first element stores the "title". Please replace "title" with the source article heading (the text to which the \T1\ tag is assigned).
The full article was prepared using the following process.
・The full text was derived by concatenating all the text annotated with the \T2\ tag in the data collection.
・The extracted full text was preprocessed using script/preprocess.py.
annotations dict Data manually annotated by a worker.
alignment_ids list[str|int] List of alignment IDs indicating the corresponding source text sentences annotated to each sentence in target text. If there is nothing to be aligned, '' is stored. The value minus 1 corresponds to the index of the list.
simplification_labels list[str] List of simplification operations assigned to each sentence in the source document.

Wikipedia corpus

Each entry consists of the following objects.

┌──── title
├──── class
├──── source_text      
└──── annotations
          ├──── targrt_text 
          ├──── simplification_labels
          ├──── alignment_ids
          ├──── summarization_ids
          │
          ├──── target_text
          ├──── simplification_labels
          ├──── alignment_ids
          └──── summarization_ids
Key Type Description
title str Wikipedia article title.
class str Category of Wikipedia articles ("featured" or "good").
source_text list[str] List of a Wikipedia article split into sentences. The title is stored in the first element.
annotations list Data manually annotated and created by two workers.
target_text list[str] List of manually created simplified articles split into sentences.
simplification_labels list[str] List of simplification operations assigned to each sentence in the source document.
alignment_ids list[str|int] List of alignment IDs indicating the corresponding target_text sentences annotated to each sentence in source_text. If there is nothing to be aligned, '' is stored. The value minus 1 corresponds to the index of the list.
summarization_ids list[list[int]] List indicating which sentences from the source_text were retained during the extractive summarization process to create the target_text. It contains as many elements as the number of extractive summaries performed. If no summarization was conducted, it is an empty list.

Note: Due to revisions made upon publication, the statistics in this dataset may differ from those presented in the paper.

Release Notes

Mainichi corpus

Version Date Updates
0.0.0 May 2, 2024 -

Wikipedia corpus

Version Date Updates
0.0.0 May 2, 2024 -
0.0.1 June 17, 2024
  • Corrected the source_text of "0.999..." article.
  • Sorted the summarization_ids.
0.1.0 June 18, 2024
  • Added the second paragraph of Wikipedia article to the source_text under 150 characters and then simplified it.
0.0.2
0.1.1
July 6, 2024
  • Corrected the title, source_text and target_text of some articles.
0.0.3
0.1.2
July 16, 2024
  • Corrected the target_text and alignmnet_ids of some articles.

Citation

If you use of the dataset, please cite:

Yoshinari Nagai, Teruaki Oka, Mamoru Komachi. A Document-Level Text Simplification Dataset for Japanese. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024.

jados's People

Contributors

nagai5115 avatar

Stargazers

zchen0420 avatar

Watchers

Mamoru Komachi avatar zchen0420 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.