This repository contains code for various scraping algorithms used for collecting Hindi language text data for the RESPIN project at SPIRE Lab, IISc Bangalore.
Each of the folders contains Python scripts for the particular method used and the text files:
- manual_collection_scraping : Code for extracting text from links that were collected manually and then scraped using the scrape_text.py script.
- scrapy_collection_scraping : Code for extracting text from links that were collected using the Scrapy framework, and then scraped using the scrape_text.py script.
text_statistics_saving contains scripts for counting the number of words, number of sentences collected, frequency of each word, and other such statistics.