Code Monkey home page Code Monkey logo

async_pubmed_scraper's Introduction

Pubmed Scraping

This is an implementation of an Asynchronous Pubmed Scraper. It allows for controlled concurrency and saves files in .parquet format, optimizing data storage efficiency.

Output File Columns

Column Name Description
url The URL of the source document.
pmid The PubMed ID of the document.
abstract The abstract or summary of the document.
keywords Keywords associated with the document.
published_date The publication date of the document.
citation_doi The DOI (Digital Object Identifier) of the document citation.
journal The journal where the document was published.
volume The volume of the journal where the document was published.
issue The issue number of the journal.
pages The page numbers of the document in the journal.

Table example

url pmid abstract keywords published_date citation_doi journal volume issue pages
Example URL 123456 This is an example abstract for documentation purposes. keyword1, keyword2 2023-09-21 10.12345/example Example Journal 42 3 101-120

Clone project

git clone https://github.com/FourierMourier/async_pubmed_scraper.git

Setup environment

Using conda

Create env

conda create --name {env} python=3.10

Activate env

conda activate {env}

Using venv

Create env

python -m venv venv

Activate env

windows

venv\Scripts\activate.bat

posix

source venv/bin/activate

Install requirements

cd async_pubmed_scraper
pip install -r "requirements.txt"

There you go

User agents list

Scraping User Agents List: Before starting the scraping process, you'll need a list of user agents. You can download the user agents list, for example, here or anywhere else you'd like.

Run

To use it, please copy the example config file main.example.yaml to main.yaml, and adjust your paths:

user_agents_list_path: path/to/your/agents_list.txt
# sets the directory path where you'll get the results from pubmed
output_dir: collected_data

Then, you can run main.py.

Some implementation notes

Optimal Implementation Approach: When it comes to implementation, you might be tempted to use a straightforward approach where you create an asyncio.Task for each individual task, such as checking URLs, and call these tasks as soon as you reach them. For instance, when processing a single page, you might create multiple tasks to process individual URLs representing article pages. However, using this approach, you'll be sharing the same aiohttp.ClientSession across all tasks, and any server error could potentially disrupt the page-wise pipeline:

Image

Alternative Approach for Enhanced Clarity: Alternatively, you can wait for all tasks corresponding to a specific real task (e.g., collecting all pages to process) before moving on to the next step (collecting all URLs to be parsed). In this scenario, the client session object will be associated with the same group of tasks and won't be nested within other tasks, leading to a clearer process flow.

By the way,

  1. without the control of concurrency using asyncio.Semaphore it's too likely for you to get banned by pubmed.
  2. asyncio.Lock object for the safe session reopening
  3. some pages indeed do NOT have abstract content so you might see smth like "...doesn't have abstract-content selected!" in the terminal

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

This README was created with the assistance of OpenAI's ChatGPT (September 25 Version), a large language model. You can learn more about it here

async_pubmed_scraper's People

Contributors

fouriermourier avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.