Code Monkey home page Code Monkey logo

kiscovery's Introduction

Kiscovery

Exploring and Verbalizing Academic Ideas by Concept Co-occurrence

This is the official repository for the paper "Exploring and Verbalizing Academic Ideas by Concept Co-occurrence" by Yi Xu, Shuqian Sheng, Bo Xue, Luoyi Fu, Xinbing Wang, and Chenghu Zhou. The paper proposes a framework based on concept co-occurrence for academic idea inspiration, which has been integrated into a research assistant system DeepReport. The system can assist researchers in expediting the process of discovering new ideas.

DeepReport can be accessed at idea.acemap.cn.

DeepReport Manual

Statistics

The evolving concept co-occurrence graph and co-occurrence citation quintuple datasets used in this study are available for download in Huggingface.

Evolving Concept Co-occurrence Graph

To train our model for temporal link prediction, we first collect 240 essential and common queries from 19 disciplines and one special topic (COVID-19). Then, we enter these queries into the paper database to fetch the most relevant papers between 2000 and 2021 with Elasticsearch, a modern text retrieval engine that stores and retrieves papers. Afterward, we use information extraction tools including AutoPhrase to identify concepts. Only high-quality concepts that appear in our database will be preserved. Finally, we construct 240 evolving concept co-occurrence graphs, each containing 22 snapshots according to the co-occurrence relationship. The statistics of the concept co-occurrence graphs are provided in Appendix I.

Download from Huggingface

Download with git, and you should install git-lfs first

sudo apt-get install git-lfs
# OR
brew install git-lfs

git lfs install
git clone https://huggingface.co/datasets/Reacubeth/ConceptGraph

Co-occurrence Citation Quintuple

We construct and release a dataset of co-occurrence citation quintuples, which is used to train text generation model for idea verbalization. The process of identifying and processing concepts is similar to constructing the concept co-occurrence graph. Heuristic rules are adopted to filter redundant and noisy sentences, further improving the quality of the quintuples used for idea generation. More details of co-occurrence citation quintuples can be found in Appendix B, C, and J.

Co-occurrence Citation Quintuple

In mid-2023, our DeepReport system underwent a major update, encompassing both data and model improvements. On the data front, we introduced a new version of the quintuple data (V202306), resulting in enhanced quality and a larger-scale dataset. The statistical summary of the new quintuple data (V202306) is presented as follows:

Discipline Quintuple Concept Concept Pair Total $p$ Total $p_1$ & $p_2$
Art 7,510 2,671 5,845 2,770 7,060
History 5,287 2,198 4,654 2,348 5,764
Philosophy 45,752 4,773 25,935 16,896 29,942
Sociology 16,017 4,054 12,796 7,066 16,416
Political Science 67,975 6,105 42,411 26,198 53,933
Business 205,297 9,608 99,329 62,332 112,736
Geography 191,958 12,029 118,563 42,317 112,909
Engineering 506,635 16,992 249,935 137,164 273,894
Geology 365,183 13,795 190,002 98,991 222,358
Medicine 168,697 13,014 114,104 42,535 138,973
Economics 227,530 9,461 113,527 68,607 131,387
Physics 267,532 10,831 133,079 84,824 176,741
Biology 224,722 15,119 145,088 59,210 189,281
Mathematics 312,670 17,751 190,734 95,951 218,697
Psychology 476,342 9,512 194,038 115,725 212,180
Computer Science 531,654 16,591 244,567 151,809 238,091
Environmental Science 583,466 11,002 226,671 94,474 201,330
Materials Science 573,032 17,098 249,251 145,068 313,657
Chemistry 565,307 13,858 231,062 108,637 286,593
Total 5,342,566 206,462 2,591,591 1,362,922 2,941,942

Download from Huggingface

Download with git

sudo apt-get install git-lfs
# OR
brew install git-lfs

git lfs install
git clone https://huggingface.co/datasets/Reacubeth/Co-occurrenceCitationQuintuple

Ethics Statement

The datasets used in our research are collected through open-source approaches. The whole process is conducted legally, following ethical requirements. As for the Turing Test in our study, all participants are well informed about the purpose of experiments and the usage of test data, and we would not leak out or invade their privacy.

We see opportunities for researchers to apply the system to idea discovery, especially for interdisciplinary jobs. We encourage users to explore different combinations of subjects with the help of our system, making the most of its knowledge storage and thus maximizing the exploration ability of the system.

The main focus of the system is to provide a possible direction for future research, but the effect of human researchers will never be neglected.

The massive data from various disciplines behind the system makes it capable of viewing the knowledge of an area in a multi-dimensional perspective and thus helps promote the development of novel interdisciplinary. However, considering the risks of misinformation generated by NLP tools, the verbalization only contains possible insights into new ideas. Researchers must thoroughly consider whether an idea is feasible or leads to adverse societal effects.

Acknowledgements

We would like to express our deepest gratitude to the scientists involved in the Deep-time Digital Earth program, whose contributions have been incredibly valuable. Additionally, we extend our thanks to Zhongmou He, Jia Guo, Zijun Di, Pandong Ma, Shengling Zhu, Yanpeng Li, Qi Li, Jiaxin Ding and Tao Shi from the IIOT Research Center at Shanghai Jiao Tong University for their unwavering support during the development of our system. This work was supported by NSF China (No.42050105, 62020106005, 62061146002, 61960206002), Shanghai Pilot Program for Basic Research - Shanghai Jiao Tong University.

Citation

If you use our work in your research or publication, please cite us as follows:

@inproceedings{xu2023exploring,
  title={Exploring and Verbalizing Academic Ideas by Concept Co-occurrence},
  author={Xu, Yi and Sheng, Shuqian and Xue, Bo and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2023}
}

Please let us know if you have any questions or feedback. Thank you for your interest in our work!

kiscovery's People

Contributors

xyjigsaw avatar

Stargazers

LittleFish avatar  avatar susisheng avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.