SemEval 2022 Task 3

Presupposed Taxonomies: Evaluating Neural Network Semantics (PreTENS)

⚠ New Notice: Evaluation Phase has now ended, please check the data folder for test data (with labels and sentence Construction). Updated dates are available in the task website. [Old News] CodaLab links are given below.

Paper Submission Rules and Deadlines

Will be updated soon

Submission Rules

Maximum Submission: 3 result submissions per subtask
Ranking: Two ranking per subtask - Per Language Ranking and Global Ranking
What results will be displayed/used in the LeaderBoard: All the measures given in baseline script (Precision, Recall, F1 and F1-macro for subtask1 and Rho for subtask2) will be shown, but the final ranking will be based on macro F1 and Rho.
The naming convention for submission file: The result/submission file will be tab separated (with headers: ID \t Labels/Score), named as answer.tsv and then compressed to a zip file with naming convention: <teamName_subtaskX_submissionNo.zip>, X={1,2} and No={1,2,3}
Results selected to display in Leaderboard: Each team will have 3 chances (per task) and from there they can choose which results to submit in the leaderboard. However, each team must submit at least one result in the board (they can change the selected entry to show anytime during competition). This is mainly given so participants attempting just selected language are not penalized by the global-ranking score mechanism.

Tasks

PreTENS includes the two following sub-tasks:

a binary classification: Predicting the acceptability of sentences (A (1) vs UA (0))

a regression task: Predicting the degree of Acceptance in a seven Likert-scale

Data

The data comprise of sentences in 3 languages: English, Italian, and French.

For each sub-task and each language:

The dataset will be split into training and test set
Additionally, a trail data (a small subset of training set) is released to give participants a proper idea of the data and expected formats.

For the binary-classification sub-task, the training and test set will be composed by ~5,000 and 23,000 samples, respectively; For the regression sub-task, ~500 sentences will be provided for the training set and a bigger for the test set.

Sample/Trail data for Evaluation Campaign: data/trail

Data Format:

ID Sentence LABELS/SCORE

where LABEL is for binary classification task and SCORE is for the regression task. SCORE: represent average of the assigned score (1-7) given by the annotator. Details of scales and agreements will be elaborated/updated later. The LABEL (1/0) is assigned based on the regression score.

TEST DATA with scores/labels are now Available.

The folder <data/test/official_test_set_with_labels> now includes test file for each subtasks for all the three languages with the labels/scores in addition to the constructions the sentences belongs to.

File Format: Subtask1

e.g.

ID Construction Sentence Labels

en_0 drather I would rather have Chianti than water . 1

Here constructions are: 'andtoo', 'butnot', 'comparatives', 'drather', 'except', 'generally', 'particular', 'prefer', 'type', 'unlike'

File Format: Subtask2

e.g,

ID Construction Sentence Scores

en_0 comparatives I like governors more than farmers. 5.83

Here constructions are:

'andtoo', 'butnot', 'comparatives', 'ingeneral', 'particular', 'type', 'unlike'

Evaluation Measures

The official evaluation metrics for the Classification tasks are: Precision, Recall, F1-measure and macro F-measure (See the sub-Task1 starter code for more details)

As for the Regression, we opt for MSE, RMSE and Spearman Correlation (rho) (See the sub-Task2 starter code for more details)

⚠ NOTICE: For each sub-task a separate baseline is defined: i) for the binary classification sub-task baseline, a Linear Support Vector classifier using n-grams (up to three) as input features is used, and ii) as for the regression sub-task, a baseline using a Linear Support Vector regressor with the same n-grams features is provided. Participants can run the evaluation system and obtain the results by using different cross-validation configurations on the training set. Due to the presence in the official test-set of additional constructions with the same presuppositional constraints, we have found that applying the baseline methods on the official test-set yields results that are from 10% to 20% lower than the training set. This highlights the importance of achieving a great deal of syntactic generality on this task. For this reason we encourage to test different cross-validation configurations on the training set.

To get our participant started with the Task, we provide baseline scripts showing how the data is processed, splited and in the end -- evaluated for the said task.

Below are the baseline and starter code:

Subtask1: https://colab.research.google.com/drive/1wDFQnEfMkoJY99Bmv-CfsTsdwleCDg2f?usp=sharing

Subtask2: https://colab.research.google.com/drive/18KwrdyTsp3wOPcaB7pyFnqOSc3Te7p-X?usp=sharing

You can also find the necessary codes in this git repository (SemEval_Task3_Baseline_subtask1.ipynb and SemEval_Task3_Baseline_subtask2.ipynb)

License

MIT

Useful links

Task Website

Participants Registration Form

Evaluation Platforms:

[Subtask1] (https://codalab.lisn.upsaclay.fr/competitions/1292)

[Subtask2] (https://codalab.lisn.upsaclay.fr/competitions/1290)

mailinglist: [email protected]

Organizers

Shammur Absar Chowdhury - Qatar Computing Research Institute, HBKU, Qatar

Dominique Brunato - Institute for Computational Linguistics "A. Zampolli" (CNR), Pisa, Italy

Cristiano Chesi - University School for Advanced Studies (IUSS), Pavia, Italy

Felice Dell'Orletta - Institute for Computational Linguistics "A. Zampolli" (CNR), Pisa, Italy

Simonetta Montemagni - Institute for Computational Linguistics "A. Zampolli" (CNR), Pisa, Italy

Giulia Venturi - Institute for Computational Linguistics "A. Zampolli" (CNR), Pisa, Italy

Roberto Zamparelli - Department of Psychology and Cognitive Science - University of Trento, Italy

For any queries: Contact: [email protected]

shammur / semeval2022task3 Goto Github PK