Code Monkey home page Code Monkey logo

godseye's Introduction

godseye

Monitor biomedical trends visually across ALL of PubMed + bioRxiv

About

Introduction

Bioinformatics is one of the fastest growing interdisciplinary fields. As new technologies emerge, new types of data come into the spotlight, thereby creating the need for novel computational approaches and methodologies that can successfully deal with those new data. As a result, the interest of the community for specific areas often shifts dramatically over a short amount of time. Here we present a text mining approach to systematically identify trending topics in Bioinformatics over time and space, as embodied in journal articles' abstracts and titles.

Using keyword prominence and an efficient temporal segmentation algorithm, our method highlights trending topics in the bioinformatics literature, and can be helpful in predicting the ever-changing demands of the bioinformatics job market.

Data Sources

We quilted together the NCBI MEDLINE®/PubMed® database and bioRxiv® database to extract all titles, abstracts, and author affiliations of all published papers and preprints, respectively.

Impact

By including bioRxiv, godseye can monitor the pulse of the biological research community with a delay of several days at most, which is the average time it takes for a preprint to be publicly displayed on bioRxiv upon submission. In contrast, PubMed's delay is at least 3-18 months, which is often the time range of the lengthy peer-review cycle. Once a preprint is published in a peer-reviewed journal and thereby available on PubMed, godseye delegates to the PubMed resource for extracting information from the abstract/title (since these are often updated relative to the bioRxiv version of the paper).

Algorithms

Keyword prominence in a time range

We define the prominence of a keyword w as the fraction of journal abstracts in a given time range that contain the keyword w. An arbitrary parameter α is then chosen to filter out keywords whose prominence is < α.

Temporal segmentation of the journal titles and abstracts

The main idea is to optimally segment the yearly data into smaller contiguous time ranges, in a way that maximizes the overlap of prominent keywords within the resulting temporal segments. The algorithm uses dynamic programming to efficiently compute an optimal segmentation, and takes as input the keyword frequency per year and the desired number of temporal segments n. For more information on the original implementation of the algorithm, see Siy et al. The algorithm returns the optimal segments and a list of prominent keywords in each segment.

Future plans

  • Modularize Python code with OOP methods
  • Analyze PDF/HTML contents of a PubMed or bioRxiv paper, not just its abstract, title, and author affiliations. For this task, consider integrating existing tools like fulltext, pdftools, and pubcrawl
  • Implement dynamic programming algorithm to achieve optimal temporal segmentation. But also re-consider other algorithm choices (besides for temporal segmentation) because perhaps there are other more suitable (optimal) alternatives
  • Expand godseye into ML territory with snorkel
  • Implement graph database to understand the relation between any set of keywords over a time period or geographical region (e.g., the spatial and temporal evolution of keyword co-occurrences)

Contact

You are welcome to:

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Contributors (alphabetical by last name)

Attribution

This work is a hard fork of biotrends. It is an academic partnership between Drs. Ghersi and Khomtchouk at UNO and Stanford, respectively.

Citation

Coming soon!

godseye's People

Contributors

kasraavand avatar bohdan-khomtchouk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.