The noun_compound_senses from taiqihe

Noun Compound Senses Dataset

This repository contains the Noun Compound Senses (NCS) dataset, used to assess the representation of idiomaticity in vector space models.

The NCS dataset has data for 280 and 180 noun compounds (NCs) in English and Portuguese, respectively, with different degrees of idiomaticity. For each compound, it contains 3 naturalistic corpus sentences and a neutral context (e.g., This is a/an NC.).

Due to copyright restrictions we do not release all the original (naturalistic) sentences. Instead, we include a script to obtain them from the ukWaC (Baroni et al., 2009) and brWaC (Wagner Filho et al., 2018) corpora (see below).

For all sentences in naturalistic and neutral contexts the dataset includes three variants (P1, P2, and P3) with the following characteristics:

P1: The original NC is replaced by a synonym (e.g., brain instead of gray matter).
P2: The original NC is replaced by its syntactic head and dependent, in two different sentences (e.g., gray, and matter).
P3: Each component of the original NC is replaced by a synonym (e.g., alligator sobs instead of crocodile tears).

The NCS dataset contains a total of 5,620 test items for English, and 3,600 for Portuguese, and it is based on the NC Compositionality dataset (Cordeiro et al., 2019; Reddy et al., 2011).

Obtaining the sentences

Requirements

Python 3
Pandas
ukWaC corpus in XML format (tagged). The 25 files (UKWAC-1.xml to UKWAC-25.xml) should be concatenated into a single one (e.g., cat UKWAC*xml > UKWAC_full.xml).
brWaC corpus in .conll format (single file brwac.conll)

Building the corpus

Use the script get_sentences.py to obtain the sentences from the WaC corpora:

python3 get_sentences.py --lang en --corpus UKWAC_full.xml

python3 get_sentences.py --lang pt --corpus brwac.conll

This should create the original_sents.csv files inside dataset/lang/naturalistic/.

Citation

If you use the Noun Compounds Senses dataset, please cite the following paper:

Garcia, Marcos, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart and Aline Villavicencio. 2021. Probing for idiomaticity in vector space models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021). Association for Computational Linguistics.

References

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209-226.

Cordeiro, Silvio, Aline Villavicencio, Marco Idiart and Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1):1–57.

Reddy, Siva, Diana McCarthy and Suresh Manandhar. 2011. An empirical study on compositionality in compound nouns. In Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011, Chiang Mai, Thailand, November 8-13, 2011, pages 210–218. The Association for Computer Linguistics.

Wagner Filho, Jorge Alberto, Rodrigo Wilkens, Marco Idiart and Aline Villavicencio. 2019. The brWaC Corpus: A New Open Resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). ELRA.

taiqihe / noun_compound_senses Goto Github PK

noun_compound_senses's Introduction

Noun Compound Senses Dataset

Obtaining the sentences

Requirements

Building the corpus

Citation

References

noun_compound_senses's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent