yaolu / multi-xscience Goto Github PK

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles

License: MIT License

summarization multi-document-summarization text-summarization

multi-xscience's Introduction

Multi-XScience

Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.

Authors: Yao Lu, Yue Dong, Laurent Charlin

Appendix: model implementation and evaluation details.

Dataset Statistics

train/val/test examples	average document length	summary length	number of references
30,369/5,066/5,093	778.08	116.44	4.42

We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.

Datasets	% of novel unigram	% of novel bi-grams	% of novel tri-grams	% of novel 4-grams
CNN-DailyMail (single)	17.00	53.91	71.98	80.29
NY Times (single)	22.64	55.59	71.93	80.16
XSum (single)	35.76	83.45	95.50	98.49
WikiSum	18.20	51.88	69.82	78.16
Multi-News	17.76	57.10	75.71	82.30
Multi-XScience	42.33	81.75	94.57	97.62

Dataset Format

key	description
aid	arxiv id (e.g. 2010.14235)
mid	microsoft academic graph id
abstract	text of paper abstract
ref_abstract	meta-information of reference papers
ref_abstract.cite_N	meta-information of reference paper cite_N (special cite symbol)
ref_abstract.cite_N.mid	reference paper's (cite_N) microsoft academic graph id
ref_abstract.cite_N.abstract	text of reference paper (cite_N) abstract

Extended Usage

Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.

multi-xscience's People

Contributors

Stargazers

Watchers

Forkers

wh-forker shainaraza pushkar198 ethicalsecurity-agency divya121-design

multi-xscience's Issues

some doubts about ROUGE-L result

The ROUGE-L results in your paper are relatively high, achieving 30.63 for Pointer-Generator. But for my model, while close to your ROUGE-1 and ROUGE-2 score, ROUGE-L is a lot worse, only 18.77.

So I wonder whether your ROUGE-L is summary-level ROUGE-L or sentence-level ROUGE-L? THX.

Could you provide me with all the system outputs of the baselines?

Hi,
I am using Multi-XScience dataset for scientific summarization task. I want to measure the performance of all the baselines (mentioned in your paper, LEAD, TEXTRANK, LEXRANK, HIERSUMM, HIMAP, BERTABS, BART, SCIBERTABS) using bert-score metric. So could you please provide me with all the system outputs if possible and convenient.
Thanks a lot.

Human evaluation

Hello!
I would like to know if I can have access to the 25 samples you randomly and the human judges scores.

Code of the model and evaluation

Hi, I am really interesting in your work, could you plz also release your code about models for the experiments( I want to know if you change original model for this new setting, how do you organize your data to the model), and your code about evaluation.

implementing n-gram repeat blocking

How did you implement tri-gram or ngram blocking in pointer-generator project?
Could you please help

The high proportion of novel unigrams

Thank you for sharing this dataset.
According to your statistic data, there are 42.33% of novel unigrams in the target summary. Is it too high for a summarization task?
I understand authors may tend to use new expressions when introducing others' previous works, and the proportion of bigram, trigram, and 4-gram in this dataset can be relatively higher than that of other datasets. But the novel unigrams may not be very common, even in academic papers.
I worry a large proportion of information in the target summaries is not included in the inputs, which may beyond the scope of text summarization. The settings of the dataset construction and the quality of data sources may contribute to the high proportion of novel unigrams.
Besides, I found there are 3403 reference papers' abstracts that are empty in the test set and some of the abstracts are not the real abstracts. I understand it is difficult to ensure the quality of data sources and thanks for your efforts to build this dataset.

Missing abstracts in dataset

There are some missing abstracts in the dataset. Is this a dataset collection issue or an issue with the released dataset?

Example where @cite_0 is missing abstract field

{
'aid': 'cs9903008', 
'mid': '2949261815', 
'abstract': "Recent technological advances have made it possible to build real-time, interactive spoken dialogue systems for a wide variety of applications. However, when users do not respect the limitations of such systems, performance typically degrades. Although users differ with respect to their knowledge of system limitations, and although different dialogue strategies make system limitations more apparent to users, most current systems do not try to improve performance by adapting dialogue behavior to individual users. This paper presents an empirical evaluation of TOOT, an adaptable spoken dialogue system for retrieving train schedules on the web. We conduct an experiment in which 20 users carry out 4 tasks with both adaptable and non-adaptable versions of TOOT, resulting in a corpus of 80 dialogues. The values for a wide range of evaluation measures are then extracted from this corpus. Our results show that adaptable TOOT generally outperforms non-adaptable TOOT, and that the utility of adaptation depends on TOOT's initial dialogue strategies.", 
'related_work': "In the area of spoken dialogue, @cite_2 has proposed a method for adapting initiative in form-filling dialogues. Whenever the system rejects a user's utterance, the system takes more initiative; whenever the user gives an over-informative answer, the system yields some initiative. While this method has the potential of being automated, the method has been neither fully implemented nor empirically evaluated. @cite_3 has evaluated strategies for dynamically deciding whether to confirm each user utterance during a task-oriented dialogue. Simulation results suggest that context-dependent adaptation strategies can improve performance, especially when the system has greater initiative. @cite_1 and @cite_0 have used reinforcement learning to adapt dialogue behavior over time such that system performance improves. We have instead focused on optimizing performance during a single dialogue.", 
'ref_abstract': {
   '@cite_0': {'mid': '200223693', 'abstract': ''}, 
   '@cite_1': {'mid': '2141839844', 'abstract': "This paper describes a novel method by which a dialogue agent can learn to choose an optimal dialogue strategy. While it is widely agreed that dialogue strategies should be formulated in terms of communicative intentions, there has been little work on automatically optimizing an agent's choices when there are multiple ways to realize a communicative intention. Our method is based on a combination of learning algorithms and empirical evaluation techniques. The learning component of our method is based on algorithms for reinforcement learning, such as dynamic programming and Q-learning. The empirical component uses the PARADISE evaluation framework (, 1997) to identify the important performance factors and to provide the performance function needed by the learning algorithm. We illustrate our method with a dialogue agent named ELVIS (EmaiL Voice Interactive System), that supports access to email over the phone. We show how ELVIS can learn to choose among alternate strategies for agent initiative, for reading messages, and for summarizing email folders."}, 
   '@cite_3': {'mid': '2063157598', 'abstract': 'As with human?human interaction, spoken human?computer dialog will contain situations where there is miscommunication. One natural strategy for reducing the impact of miscommunication is selective verification of the user utterance meanings. This paper reports on both context-independent and context-dependent strategies for utterance verification that show that the use of dialog context can be very helpful in selecting which utterances to verify. Simulations with data collected during experimental trials with the Circuit Fix-It Shop spoken natural language dialog system are used in the analysis. In addition, the performance of various selection strategies is measured separately for computer-controlled and user-controlled dialogs and general guidelines for selecting an appropriate strategy are presented.'}, 
   '@cite_2': {'mid': '1882353391', 'abstract': 'While user modelling has become a mature field with demonstrable research systems of great power, comparatively little progress has been made in the development of user modelling components for commercial software systems. The development of minimalist user modelling components, simplified to provide just enough assistance to a user through a pragmatic adaptive user interface, is seen by many as an important step toward this goal. This paper describes the development, implementation, and empirical evaluation of a minimalist user modelling component for TIMS, a complex commercial software system for financial management. The experimental results demonstrate that a minimalist user modelling component does improve the subjective measure of user satisfaction. Important issues and considerations for the development of user modelling components for commercial software systems are also discussed.'}
}
}

Document length

Are the specified lengths in tokens, words or characters? Please consider updating the README for clarity.

Thank you for the dataset :)

doubt: a much lower rouge score of pointer-generator

The rouge(1/2/L) score of pointer-generator reported in the paper is: 34.11/6.76/30.63 ,
however, I run the pointer-generator model (guided by https://opennmt.net/OpenNMT-py/examples/Summarization.html) but find the rouge is: 30/5.11/26.42, when training for 200,000 steps.
So could you please provide me your pointer-generator model result, or your training setting?

hiersumm

Hi, as for hiersumm, how did you prepare your data, did you use his sentencepiece model, how did you encode your text, I noticed he didn't release the code for preparing data. Thanks in advance.
Also for evaluation, do you think which one is better, replace cite_2 with @cite or just remove cite_2.