Code Monkey home page Code Monkey logo

ahmed-u-github / pubmed_scraper_gpt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ybryan95/pubmed_scraper_gpt

1.0 0.0 1.0 157.6 MB

This Python script retrieves and analyzes scientific literature from PubMed related to specific genes, creates a word cloud visualization of key terms in the abstracts, and saves the results in a Word document. It utilizes the OpenAI GPT-3 model, BioPython's Entrez, and the mygene.info API for its operation.

Jupyter Notebook 100.00%

pubmed_scraper_gpt's Introduction

PubMed Abstract Analysis and Report Generation

This code is a powerful tool developed to automate the process of finding, analyzing, and documenting biomedical research papers from PubMed with a focus on specific genes and conditions such as Autism. It utilizes the power of the Natural Language Processing (NLP) model GPT-4 by OpenAI and biomedical NLP libraries from spaCy to search, analyze and present results in a comprehensive way. Note: This is an improved version of (https://github.com/ybryan95/PubMed_scraper_GPT) that incorporates the GPT model and requires an openAI key to operate. If you want to stick to the free version, go to the link provided.

Table of Contents

  1. Installation
  2. Scripts
  3. Usage
  4. Key Functions
  5. Wordcloud
  6. Configuration
  7. Dependencies

Installation

The scripts in this repository utilize various Python libraries including:

  • Biopython: For extracting abstracts from PubMed.
  • nltk: For natural language processing tasks.
  • pandas: For data manipulation.
  • OpenAI's GPT: For context-based analysis.
  • matplotlib: For data visualization.

You can install all these libraries using pip (example below):

pip install biopython nltk pandas openai matplotlib

Scripts

This repository contains a single Python script that performs several operations:

Fetch abstracts related to a certain gene and its full name from PubMed. Use OpenAI's GPT to analyze whether the gene is used in the context of a transcription factor. Generate a word cloud to visualize the frequency of terms in the abstracts. Generate a report document containing the gene, abstract, and corresponding word cloud.

Usage

After cloning the repository, update the TF2.xlsx file with the genes of interest. Then, open the ipynb file with Jupyter notebook and make sure your TF2.xlsx file is in the same directory as ipynb file.

This script takes a list of genes from an Excel file ('TF2.xlsx') and for each gene:

Fetches the full name of the gene from the MyGene.info API. Forms a search query to be used in PubMed to fetch articles related to that gene and Autism. Filters out any articles related to Cancer or Tumors. For each article found, it: Fetches the title, publication date, and abstract. Uses the OpenAI API to answer specific questions about the gene in the context of the abstract. Determines the sentiment of the OpenAI's response. If the sentiment is positive, it adds the article details to a dataframe. Generates a word cloud image for the abstract of the included articles. Writes all the gene information, article details, and word cloud image to a Word document. In the Word document output ('output.docx'), each gene is represented by two rows. The first row contains the gene name and the second row contains the article details and the word cloud image.

Key Functions

The report generated by the script is a Word document with the following structure for each abstract: Gene: The gene of interest. Info: The abstract and related information including the URL, DOI, Title, and Year. Wordcloud: An image of a word cloud representing the frequency of terms in the abstract.

You can find the sample output.docx on this GitHub.

  • gene_fullName(): This function takes the gene abbreviation and returns the full gene name using the MyGene.info API.
  • gene_to_search(): It takes a gene and its full name as inputs and creates two search queries to be used in PubMed.
  • search_Query_GPT(): For each query, it fetches articles from PubMed and analyzes them to see if they should be included in the results.
  • article_Interest(): It determines whether an article should be included in the final results based on the sentiment of the OpenAI's response.
  • generate_wordcloud(): It generates a word cloud image for the abstract of the included articles.

Wordcloud

A word cloud is a visual representation of text data where the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.

The word cloud in the report is generated from the abstract and includes genes, anatomical structures, diseases, taxons, and stem cells. The word cloud provides a quick visual understanding of the main themes in the abstract.

Configuration

Before running the script, make sure to replace 'Your Open AI key' with your actual OpenAI API key.

Dependencies

Before running the code, make sure that you have the following dependencies installed: openai, nltk, biopython, spaCy, pandas, requests, matplotlib, python-docx,

For the NLP task, it uses the following spaCy models: en_ner_bionlp13cg_md, en_ner_bc5cdr_md, en_ner_craft_md,

pubmed_scraper_gpt's People

Contributors

ybryan95 avatar ahmed-u-github avatar

Stargazers

 avatar

Forkers

ananyachinni

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.