Code Monkey home page Code Monkey logo

lorae / roundup Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 12.78 MB

Web scraper which aggregates pre-print academic economics papers from 20+ sources; presents titles, abstracts, authors and hyperlinks on an online dashboard. Auto-updates daily.

Home Page: https://roundup.streamlit.app/

License: MIT License

Python 100.00%
economics streamlit macroeconomics microeconomics api-scraping html-scraping selenium streamlit-dashboard streamlit-webapp web-scraping

roundup's Introduction

The website is active!

View it here: https://roundup.streamlit.app/

About

The purpose of this project is regularly track and present the most recent working papers in economics. ('Working papers', also known as 'pre-print' papers, present academic research that has not yet been peer-reviewed.) Remotely run via GitHub Actions once daily, this project scrapes data from working paper publishers at 6:40 AM EST, compares newly collected data to a historic database of working papers, and presents only the most recent ones on the project dashboard. The dashboard may be of use for those interested in understanding the most recent active areas of economics research, such as economists, policy-oriented researchers, and students. As of April 2024, the project incorporates data from 21 different sources.

How it works

The primary purpose of this repository is to maintain and improve the project dashboard. A detailed summary of the web scraping, GitHub Actions, and StreamLit components of this project follow.

Web Scraping

The web scrapers in this project gather eight pieces of information on each economics working paper:

  • Title
  • Author(s)
  • Abstract: A one-paragraph long summary of the paper.
  • Date published: As reported by the website of origin. If the paper was posted and re-posted, the most recent date of publication is used.
  • estPubDate: Our best estimate for when the paper was posted: Sometimes this may differ from the official publication date by a few days. We define estPubDate as the first date that the paper was encountered by our web scrapers.
  • URL: URLs to working paper landing pages, rather than direct links to PDFs, are preferred. However, with some sources, this is not possible, and PDF URLs are provided instead.
  • Paper ID number: According to each website's own numbering system.
  • Source: Name of the website where the paper was published.

A variety of methods are used to access web-based data, including PDF rendering, direct requests to hidden APIs, use of Selenium to read Java-rendered content, HTML parsing, and more. A high-level summary of the methods used to scrape each website can be found in the Data Sources section of this document.

This project is maintained in an object-oriented format. Each website has a bespoke method, called fetch_data(), designed to scrape it and defined in a website-specific class located in src/scraper/sites. To impose order on the 20+ web scrapers involved in this project, each website-specific scraper class is a child of the GenericScraper abstract base class, defined in src/scraper/generic_scraper.py.

All 20+ websites can be scraped by running the run_scraper.py script, located in the root directory. It instantiates each scraper class sequentially and saves the collected data in uniformly-structured Pandas data frames. It also records whether the scrape was successful or unsuccessful in streamlit/scraper_status.txt. This list of scraped data frames is then combined into one resultant data frame of all scraped working paper entries in a given run.

The data frame of web scrape results is then passed to various methods in the HistoricDataComparer class from src/scraper/data_comparer.py. These methods identify novel (versus previously encountered) data by comparing the newly-scraped paper identifiers to old identifiers. All those that are truly novel are assigned an estimated publication date of the day that they were first identified, and appended to the data/historic-wp-data.csv and data/historic-wp-ids.txt files, which are used in identifying and maintaining a record of data encountered so far.

GitHub Actions Automated Workflow

The web scraping activity in this repository is fully automated through a GitHub Actions project workflow. When .github/workflows/main.yml is activated at 6:40 AM Eastern Standard Time every day, it runs the main script of the project - run_scraper.py - which cycles through each of the 20+ web scraping modules, and then through data comparison modules that identify and store newly-encountered data. The project workflow concludes by committing its changes to data/historic-wp-data.csv and data/historic-wp-ids.txt under the alias actions-user. These commits are given the description, "run (DD/MM/YYYY)", with "DD/MM/YYYY" populated by the day, month, and year the action was initiated.

StreamLit Dashboard

The streamlit_app.py script produces the project website, which is a user-friendly aggregation of the most recent economics research. The app draws primarily from the data/historic-wp-data.csv file to populate itself with information. StreamLit automatically refreshes its data display roughly every 5 minutes. Thus, within 5 minutes of a daily commit made by GitHub Actions, changes to data/historic-wp-data.csv should be reflected on the website. The app also uses information from streamlit/scraper_status.txt to display active and inactive scrapers in the sidebar.

GitHub Actions are initiated at 6:40 AM EST and take roughly 4-6 minutes to run. With an additional 5 minutes of time built in for StreamLit to update, all newly scraped information should be reflected on the project dashboard by 7:00 AM EST daily.

Data Sources

Websites that are scraped for data, as of April 2024, are:

Name of website Name of script Scraping method
Bureau of Economic Analysis src/scraper/sites/bea_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Becker Friedman Institute (at the University of Chicago) src/scraper/sites/bfi_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Bank for International Settlements src/scraper/sites/bis_scraper.py Requests and parses the source's main RSS feed using feedparser.
Bank of England src/scraper/sites/boe_scraper.py Requests and parses the source's main RSS feed using feedparser. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
European Central Bank src/scraper/sites/ecb_scraper.py Uses Selenium to access the source's main page and parses the output using BeautifulSoup.
Federal Reserve Bank of Atlanta src/scraper/sites/fed_atlanta_scraper.py Requests and parses the source's main RSS feed using feedparser.
Federal Reserve Board of Governors (of the United States): working papers src/scraper/sites/fed_board_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup.
Federal Reserve Board of Governors (of the United States): Fed Notes src/scraper/sites/fed_board_notes_scraper.py Sends GET requests to the source's main pages and parses the responses using BeautifulSoup.
Federal Reserve Bank of Atlanta src/scraper/sites/fed_atlanta_scraper.py Requests and parses the source's main RSS feed using feedparser.
Federal Reserve Bank of Boston src/scraper/sites/fed_boston_scraper.py Sends a GET request to the source's API and parses the JSON-formatted response. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of Chicago src/scraper/sites/fed_chicago_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of Cleveland src/scraper/sites/fed_cleveland_scraper.py Uses Selenium to access the source's main page and parses the output using BeautifulSoup.
Federal Reserve Bank of Dallas src/scraper/sites/fed_dallas_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is used to access the working paper itself, with content parsed using made to each working paper's landing page and parsed using PyPDF2 and io.
Federal Reserve Bank of Kansas City src/scraper/sites/fed_kansas_city_scraper.py Sends a POST request to the source's API and parses the JSON response to get titles for each working paper entry. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of Minneapolis src/scraper/sites/fed_minneapolis_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of New York src/scraper/sites/fed_new_york_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. Method also will send a similar GET request corresponding to the previous year's entries. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup to extract working paper abstracts.
Federal Reserve Bank of Philadelphia src/scraper/sites/fed_philadelphia_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup and then as JSON data. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of Richmond src/scraper/sites/fed_richmond_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
Federal Reserve Bank of San Francisco src/scraper/sites/fed_san_francisco_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup.
Federal Reserve Bank of St. Louis src/scraper/sites/fed_st_louis_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup.
International Monetary Fund src/scraper/sites/imf_scraper.py Sends a GET request to the source's main page and parses the response using BeautifulSoup. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.
National Bureau of Economic Research src/scraper/sites/nber_scraper.py Sends a GET request to the source's API and parses the JSON response. A secondary GET request is made to each working paper's landing page and parsed using BeautifulSoup.

Getting Started

The web scrapers are run remotely at 6:40 AM EST daily via the project GitHub Actions workflow located in .github/workflows/main.yml. No additional action is required to initiate this process.

However, the web scrapers may also be operated on your local machine. See below for instructions on how to run the project for the first time on your local machine and any subsequent time.

If running a local instance for the first time:

  1. Clone the repository:

    git clone https://github.com/lorae/roundup

  2. Set your working directory in the repository:

    cd your/path/to/roundup

  3. Create a virtual environment:

    python -m venv .venv

  4. Source the virtual environment:

    • Using bash/zsh:

      source .venv/bin/activate

    • Using Windows PowerShell:

      .venv/Scripts/activate

  5. Install dependencies

    python -m pip install -r requirements.txt

  6. Start the script

    python run_scraper.py

  7. View results:

    Local unique results will be stored in data/local_scrape_outcomes. Three files are created with each run:

    • YYYY-MM-DD-HHMM-data.csv: Contains all novel working paper entries with metadata.

    • YYYY-MM-DD-HHMM-dashboard.html: Browser-viewable dashboard file with clickable links for each entry.

    • YYYY-MM-DD-HHMM-ids.txt: Unique identifiers of the novel entries, formatted as a Python set.

    Note that "YYYY-MM-DD-HHMM" will be populated with the day, hour and minute that you instantiated the HistoricDataComparer class, which contains the save_results method that produces these files.

    Results will also be appended to data/historic-wp-data.csv and data/historic-wp-ids.txt with each run of run_scraper.py. However, these files are more difficult to comfortably view directly, as they contain all working papers ever encountered by the web scrapers.

    Note that the results in data/local_scrape_outcomes/ are local only, and ignored by GitHub (.gitignore). The interactive project dashboard on https://roundup.streamlit.app will not be updated with results from your local run unless you commit your changes to data/historic-wp-data.csv to the main branch of the repository.

If running a local instance again:

  1. Set your working directory into the repository:

    cd your/path/to/roundup

  2. Source the virtual environment:

    • Using bash/zsh:

      source .venv/bin/activate

    • Using Windows PowerShell:

      .venv/Scripts/activate

  3. Start the script

    python runall.py

  4. View results:

    As above.

Project Structure

The schematic below illustrates the basic file structure of the project.

roundup/
│
├── .gitignore
├── README.md # The file you are currently reading
├── run_scraper.py # Main project script
├── requirements.txt # Project dependencies
│
├── src/
│ ├── data_comparer.py # Defines `HistoricDataComparer` class for data comparison and saving
│ └── scraper/ # Web scraping modules
│   ├── generic_scraper.py # Defines `GenericScraper` class
│   ├── external_requests.py # Methods for accessing and parsing remote data
│   └── sites/ # Website-specific scraping modules
|     ├── bea_scraper.py
|     ├── bfi_scraper.py
|     ├── bis_scraper.py
|     ├── boe_scraper.py
|     ├── ecb_scraper.py
|     └── ...
│
├── data/
│ ├── wp_ids.txt # Set of unique IDs for previously encountered data
│ ├── wp_data.csv # Table of previously encountered data
│ └── local_scrape_outcomes/ # Holds results of local scrapes
│   ├── README.md
│   ├── YYYY-MM-DD-HHMM-dashboard.html
│   ├── YYYY-MM-DD-HHMM-data.csv
│   ├── YYYY-MM-DD-HHMM-ids.txt
│   └── ...
│
├──.github/workflows/
│ └── main.yml # Runs `Daily Run` GitHub Actions workflow
│
└── streamlit
  ├── app.py # Creates Streamlit website
  └── scraper_status.txt # Tracks active and inactive web scrapers

Web Scraping Disclaimer

This code accesses remote data via network requests. The scripts are programmed to make minimal and spaced-out requests to avoid putting undue load on the servers of the data sources. Our goal is to collect data responsibly without disrupting the services of the websites we access. The collected data is intended for academic and research purposes only.

All efforts have been made to ensure that the scraping activities comply with the terms of service and /robots.txt directives of the websites from which data is being collected. However, terms of service can change, and it is the sole responsibility of users of this code to ensure ongoing compliance. The developers of this project accept no liability for any misuse of the data or any damages that might arise from the use of this code.

By using this code, you agree to abide by these principles and to regularly check the terms of service of the websites you are scraping to ensure compliance.

roundup's People

Contributors

actions-user avatar eliasdourian avatar liamstojanovic avatar lorae avatar lorae-brookings avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

liamstojanovic

roundup's Issues

repair IMF.py

Abstract data Xpath index is not working. Revisit this script and potentially use an API instead

repair Fed_Boston.py

Fed_Boston.py is not able to navigate to 2024 publications. Adjust the web scraper accordingly. Explore API options.

Repair ECB 'abstract' data collection

Current ECB 'Abstract' field contains keyword data, like this:

Image

Ensure ECBScraper class gathers correct Abstract data field, and edit existing database for already collected entries.

create global list of scraper IDs to be used in compare.py, streamlit_app.py, and runall.py

Currently, there are three lists of scraper IDs, each contained in compare.py, streamlit_app.py, or runall.py, which look like the following:

source_order = ['NBER', 'FED-BOARD', 'FED-BOARD-NOTES', 'FED-ATLANTA', 'FED-BOSTON', 'FED-CHICAGO', 'FED-CLEVELAND', 'FED-DALLAS', 'FED-KANSASCITY', 'FED-NEWYORK', 'FED-PHILADELPHIA', 'FED-RICHMOND', 'FED-SANFRANCISCO', 'FED-STLOUIS', 'BEA', 'BFI', 'BIS', 'BOE', 'ECB', 'IMF']

Integrating a new scraper module in the project involves updating all three lists, which is unintuitive and prone to producing bugs.

Resolving this issue would involve creating a global order of sources that can be called by all 3 scripts.

Investigate StreamLit handling of text strings with dollar signs

The following text appeared incorrectly on the StreamLit app:

"Two in five Americans have medical debt, nearly half of whom owe at least $2,500. Concerned by this burden, governments and private donors have undertaken large, high-profile efforts to relieve medical debt. We partnered with RIP Medical Debt to conduct two randomized experiments that relieved medical debt with a face value of $169 million for 83,401 people between 2018 and 2020. We track outcomes using credit reports, collections account data, and a multimodal survey. There are three sets of results. First, we find no impact of debt relief on credit access, utilization, and financial distress on average. Second, we estimate that debt relief causes a moderate but statistically significant reduction in payment of existing medical bills. Third, we find no effect of medical debt relief on mental health on average, with detrimental effects for some groups in pre-registered heterogeneity analysis."

Image

Something is happening to make it italicized unintentionally.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.