This code is a powerful tool developed to automate the process of finding, analyzing, and documenting biomedical research papers from PubMed with a focus on specific genes and conditions such as Autism. It utilizes the power of the Natural Language Processing (NLP) model GPT-4 by OpenAI and biomedical NLP libraries from spaCy to search, analyze and present results in a comprehensive way. Note: This is an improved version of (https://github.com/ybryan95/PubMed_scraper_GPT) that incorporates the GPT model and requires an openAI key to operate. If you want to stick to the free version, go to the link provided.
The scripts in this repository utilize various Python libraries including:
- Biopython: For extracting abstracts from PubMed.
- nltk: For natural language processing tasks.
- pandas: For data manipulation.
- OpenAI's GPT: For context-based analysis.
- matplotlib: For data visualization.
You can install all these libraries using pip (example below):
pip install biopython nltk pandas openai matplotlib
This repository contains a single Python script that performs several operations:
Fetch abstracts related to a certain gene and its full name from PubMed. Use OpenAI's GPT to analyze whether the gene is used in the context of a transcription factor. Generate a word cloud to visualize the frequency of terms in the abstracts. Generate a report document containing the gene, abstract, and corresponding word cloud.
After cloning the repository, update the TF2.xlsx file with the genes of interest. Then, open the ipynb file with Jupyter notebook and make sure your TF2.xlsx file is in the same directory as ipynb file.
This script takes a list of genes from an Excel file ('TF2.xlsx') and for each gene:
Fetches the full name of the gene from the MyGene.info API. Forms a search query to be used in PubMed to fetch articles related to that gene and Autism. Filters out any articles related to Cancer or Tumors. For each article found, it: Fetches the title, publication date, and abstract. Uses the OpenAI API to answer specific questions about the gene in the context of the abstract. Determines the sentiment of the OpenAI's response. If the sentiment is positive, it adds the article details to a dataframe. Generates a word cloud image for the abstract of the included articles. Writes all the gene information, article details, and word cloud image to a Word document. In the Word document output ('output.docx'), each gene is represented by two rows. The first row contains the gene name and the second row contains the article details and the word cloud image.
The report generated by the script is a Word document with the following structure for each abstract: Gene: The gene of interest. Info: The abstract and related information including the URL, DOI, Title, and Year. Wordcloud: An image of a word cloud representing the frequency of terms in the abstract.
You can find the sample output.docx on this GitHub.
- gene_fullName(): This function takes the gene abbreviation and returns the full gene name using the MyGene.info API.
- gene_to_search(): It takes a gene and its full name as inputs and creates two search queries to be used in PubMed.
- search_Query_GPT(): For each query, it fetches articles from PubMed and analyzes them to see if they should be included in the results.
- article_Interest(): It determines whether an article should be included in the final results based on the sentiment of the OpenAI's response.
- generate_wordcloud(): It generates a word cloud image for the abstract of the included articles.
A word cloud is a visual representation of text data where the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.
The word cloud in the report is generated from the abstract and includes genes, anatomical structures, diseases, taxons, and stem cells. The word cloud provides a quick visual understanding of the main themes in the abstract.
Before running the script, make sure to replace 'Your Open AI key' with your actual OpenAI API key.
Before running the code, make sure that you have the following dependencies installed: openai, nltk, biopython, spaCy, pandas, requests, matplotlib, python-docx,
For the NLP task, it uses the following spaCy models: en_ner_bionlp13cg_md, en_ner_bc5cdr_md, en_ner_craft_md,