Code Monkey home page Code Monkey logo

mimics's Introduction

MIMICS: A Large-Scale Data Collection for Search Clarification

Asking a clarification has also been recognized as a major component in conversational information seeking systems. MIMICS is a collection of search clarification datasets for real search queries sampled from the Bing query logs. Each clarification in MIMICS consists of a clarifying question and up to five candidate answers. Here is an example:

query headaches
question What do you want to know about this medical condition?
candidate answers (options) symptom, treatment, causes, diagnosis, diet

MIMICS contains three datasets:

  • MIMICS-Click includes over 400k unique queries, their associated clarification panes, and the corresponding aggregated user interaction signals (i.e., clicks).
  • MIMICS-ClickExplore is an exploration data that includes aggregated user interaction signals for over 60k unique queries, each with multiple clarification panes.
  • MIMICS-Manual includes over 2k unique real search queries. Each query-clarification pair in this dataset has been manually labeled by at least three trained annotators. It contains graded quality labels for the clarifying question, the candidate answer set, and the landing result page for each candidate answer.

MIMICS enables researchers to study a number of tasks related to search clarification, including clarification generation and selection, user engagement prediction for clarification, click models for clarification, and analyzing user interactions with search clarification. For more information, refer to the following paper:

For more information of clarification generation and user interactions with clarification, refer to the following artciles:

Data Format

The datasets are released in a tab-separated file format (TSV), with the header in the first row of each file. The column descriptions are given below. For more detail, refer to the paper mentioned above.

MIMICS-Click and MIMICS-ClickExplore

Column(s) Description
query (string) The query text.
question (string) The clarifying question.
option_1, ..., option_5 (string) Up to five candidate answers.
impression_level (string) A three-level impression label (i.e., low, medium, or high).
engagement_level (integer) A label in [0, 10] representing total user engagements.
option_cctr_1, ..., option_cctr_5 (real) The conditional click probability on each candidate answer.

MIMICS-Manual

Column(s) Description
query (string) The query text.
question (string) The clarifying question.
option_1, ..., option_5 (string) Up to five candidate answers.
question_label (integer) A three-level quality label for the clarifying question
options_overall_label (integer) A three-level quality label for the candidate answer set
option_label_1, ..., option_label_5 (integer) The conditional click probability on each candidate answer.

The Bing API's Search Results for MIMICS Queries

To enable researchers to use the web search result page (SERP) information, we have also released the search results returned by the Bing's Web Search API for all the queries in the MIMICS datasets. The SERP data can be downloaded from here (3.2GB compressed, 16GB decompressed). Each line in the file can be loaded as a JSON object and contain all information returned by the Bing's Web Search API.

Citation

If you found MIMICS useful, you can cite the following article:

Hamed Zamani, Gord Lueck, Everest Chen, Rodolfo Quispe, Flint Luu, and Nick Craswell. "MIMICS: A Large-Scale Data Collection for Search Clarification", In Proc. of CIKM 2020.

bibtex:

@inproceedings{mimics,
  title={MIMICS: A Large-Scale Data Collection for Search Clarification},
  author={Zamani, Hamed and Lueck, Gord and Chen, Everest and Quispe, Rodolfo and Luu, Flint and Craswell, Nick},
  booktitle = {Proceedings of the 29th ACM International on Conference on Information and Knowledge Management},
  series = {CIKM '20},
  year={2020},
}

License

MIMICS is distributed under the MIT License. See the LICENSE file for more information.

Terms and Conditions

The ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of information retrieval and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided "as is" without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us at [email protected].

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

mimics's People

Contributors

hamed-zamani avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mimics's Issues

Annotation handbook

Could you please share the annotation handbook and the algorithm mentioned in "Generating Clarifying Questions for Information Retrieval". Fully understand if you can't for sensitive things.

About impressipon level

Hi, in MIMICS-Click and MIMICS-ClickExplore datasets, what's the meaning of impression_level? Can it denote the quality of the clarifying question?

About the intents of queries

In your paper "Analyzing and Learning from User Interactions for Search Clarification", you use two datasets for estimating the
intents of each query. However, you did not release the intent data. Will you release the intent data to promote the research on MIMICS?

Related to Conditional Click Probabilities

Hi,

Thanks for releasing a user interaction dataset related to conversation search, this is a great resource for people working on ltr for conversational search.

I have gone through the dataset paper and I have one question; please pardon me if it was already described in the paper and I missed it.

Is the conditional click probability for each candidate its corresponding propensity score?
Context: I am thinking of modelling it in a contextual bandit setup, hence this question.

Kindly let me know.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.