ETL-webscraper is an app that combines a web scraper that scans and grabs "recently added" release data from aboveboard distribution website, and a helper "cleaner" module that reformats data into a required format.
- Gather new releases data and save it to a .json file
- Clean, reformat and modify data
- Save cleaned data to an Excel Spreadsheet
ETL-webscraper uses a number of open source projects to work properly:
- [node.js]
- puppeteer
- Python3
- Pandas
- Numpy
- Openpyxl
You need a B2B account with abboveboarddist (KRD). You need Microsoft Excel to open the output data file.
In ./CONFIG.py update venv_path with a path to your venv you'll use, and update variable "discount" with your actual discount percentage value as a string (ie 50% = "50").
In ./modules/scraper update CONSTS.js file with your own user and password details.
ETL-webscraper requires [Node.js] and Python3 to run.
Install Python3 dependencies...
cd ETL-webscraper
pip3 install -r requirements.txt
...and node dependencies...
cd ./modules/scraper/
npm i
...and run the application...
cd ETL-webscraper
python3 scrape.py
Cleaned and transformed Excel file is exported to ./output/DATAFILE.xlsx.