lorae / roundup Goto Github PK

Web scraper which aggregates pre-print academic economics papers from 20+ sources; presents titles, abstracts, authors and hyperlinks on an online dashboard. Auto-updates daily.

Home Page: https://roundup.streamlit.app/

License: MIT License

Python 100.00%

economics streamlit macroeconomics microeconomics api-scraping html-scraping selenium streamlit-dashboard streamlit-webapp web-scraping

roundup's Issues

Improve `append_data_to_historic` exception handling

The function append_data_to_historic within the HistoricDataComparer class contains a try-except block. The except is specifically a FileNotFoundError, which can introduce issues if the file is found but does not comport to the data standards assumed in the try block.

Create more graceful, generic error handling and a specific version of error handling when the file exists but does not comport with data standards. Once this is complete, remove the print statements (e.g. "existing_df read", "column order mapped", etc).

Create a Fed_Minneapolis.py web scraper

Create a Fed_KansasCity.py web scraper

Create IZA.py scraper

repair Fed_Boston.py

Fed_Boston.py is not able to navigate to 2024 publications. Adjust the web scraper accordingly. Explore API options.

rename roundup_scripts folder to src and historic folder to data

Introduce generic scraper parent class

Make the Streamlit app look nicer

Change the way entries display so that hyperlinked title is above a posted pubdate and estimate pubdate, authors, and abstract.

Repair ECB 'abstract' data collection

Current ECB 'Abstract' field contains keyword data, like this:

Ensure ECBScraper class gathers correct Abstract data field, and edit existing database for already collected entries.

Repair fed_kansas_city_scraper.py

Allow Selenium to run in GitHub Actions

I suspect it has to do with Selenium.

Define and implement uniform data structure for scrape results

Explicitly build all file paths into class level constants in data_comparer.py

implement an external database

introduce env setup to specify path for local run of runall script

Introduce generic logic to test whether scraper should also scrape prior year's entries

Should be based on whether the current month is January.

Current scrapers have a mix of logic, but have been migrating toward always scraping the current and former year's entries. This may mean overuse of computing resources.

repair IMF.py

Abstract data Xpath index is not working. Revisit this script and potentially use an API instead

Investigate StreamLit handling of text strings with dollar signs

The following text appeared incorrectly on the StreamLit app:

"Two in five Americans have medical debt, nearly half of whom owe at least $2,500. Concerned by this burden, governments and private donors have undertaken large, high-profile efforts to relieve medical debt. We partnered with RIP Medical Debt to conduct two randomized experiments that relieved medical debt with a face value of $169 million for 83,401 people between 2018 and 2020. We track outcomes using credit reports, collections account data, and a multimodal survey. There are three sets of results. First, we find no impact of debt relief on credit access, utilization, and financial distress on average. Second, we estimate that debt relief causes a moderate but statistically significant reduction in payment of existing medical bills. Third, we find no effect of medical debt relief on mental health on average, with detrimental effects for some groups in pre-registered heterogeneity analysis."

Something is happening to make it italicized unintentionally.

Create IZA.py web scraper

Create a Fed_StLouis.py web scraper

Repair Fed_SanFrancisco.py

Script appears to not be collecting data. Issue may relate to Chromium driver.

Fix the way that BOE.py handles dates

It appears to be cutting date string off - e.g. "Fri, 19 Jan 2" rather than "Fri, 19 Jan 2024"

repair fed_san_francisco_scraper.py

appears to now have issues collecting author names from website

Create Jupyter web scraper module debugging tutorial

Add to a new directory called "docs" or something similar. Walk through the process of resolving a broken script and describe common issues.

It would be especially cool if the notebook could show code from specific scripts within the project without copy-pasting the code (e.g. dynamically pulling code from a different file). I am not sure if this concept has an already built-out method, but it would be nice, especially when producing call-outs to explain how the GenericScraper ABC relates to the concrete scraper classes.

See if there is something wrong with Fed_Dallas.py

Hasn't found a new working paper since Sept 2023

create global list of scraper IDs to be used in compare.py, streamlit_app.py, and runall.py

Currently, there are three lists of scraper IDs, each contained in compare.py, streamlit_app.py, or runall.py, which look like the following:

source_order = ['NBER', 'FED-BOARD', 'FED-BOARD-NOTES', 'FED-ATLANTA', 'FED-BOSTON', 'FED-CHICAGO', 'FED-CLEVELAND', 'FED-DALLAS', 'FED-KANSASCITY', 'FED-NEWYORK', 'FED-PHILADELPHIA', 'FED-RICHMOND', 'FED-SANFRANCISCO', 'FED-STLOUIS', 'BEA', 'BFI', 'BIS', 'BOE', 'ECB', 'IMF']

Integrating a new scraper module in the project involves updating all three lists, which is unintuitive and prone to producing bugs.

Resolving this issue would involve creating a global order of sources that can be called by all 3 scripts.

lorae / roundup Goto Github PK

roundup's Issues

Recommend Projects

Recommend Topics

Recommend Org