Code Monkey home page Code Monkey logo

scrapy-and-beautifulsoup-web-scraper-for-asp.net-webpages's Introduction

Web Scraper

Introduction

Scraping aspx form based webpage is different and slightly complex than scraping the usual websites where you can generate a list of urls to be scraped. These websites usually send state data in requests and responses in order to keep track of the client's UI state.The __VIEWSTATE field is passed around with each POST request that the browser makes to the server. The server then decodes and loads the client's UI state from this data, performs some processing, computes the value for the new view state based on the new values and renders the resulting page with the new view state as a hidden field

In this project we'll be scraping data from this website - http://swachhbharaturban.gov.in/ihhl/RPTApplicationSummary.aspx

We are interested in creating a csv file where the scraped data will get saved with headers in this order

State | District | ULB Name | Ward | No. of Applications Received | No. of Applications Not Verified | No. of Applications Verified | No. of Applications Approved | No. of Applications Approved having Aadhar No. | No. of Applications Rejected | No. of Applications Pullback | No. of Applications Closed | No. of Constructed Toilet Photo | No. of Commenced Toilet Photo | No. of Constructed Toilet Photo through Swachhalaya

Background

Pressing F12 opens up the developer window (Network tab)

Select a state from the list and you will see that a request to "RPTApplicationSummary.aspx" has been made. Clicking on the response - RPTApplicationSummary.aspx leads you to the request details where you can see that your browser sent the state you've selected along with the __VIEWSTATE data that was in the original response from the server.

On further selecting the District another POST request is sent to the server.

Finally on Selecting the ULB, you can see the wards under that particular ULB and that's the order of the data that we are interested in scraping. Our spider has to simulate the user interaction of selecting State --> District --> ULB and submitting the form followed by scraping the resulting page.

Languages, Tools and Frameworks Employed

  • Python
    • Follow my tutorial here
  • Pip - package-management system used to install and manage software packages written in Python
    • Download get-pip.py to a folder on your computer.
    • Open a command prompt and navigate to the folder containing get-pip.py.
    • Run the following command:
    • $ python get-pip.py
    • Pip is now installed!
  • Scrapy framework - free and open-source web-crawling framework written in Python.
    • $ pip install Scrapy==1.3.3
  • BeautifulSoup from bs4 library
    • $ pip install beautifulsoup4

Methodology

Before creating the spider we'll state a rough algorithm listing the steps that our spider will traverse:

Fetch http://swachhbharaturban.gov.in/ihhl/RPTApplicationSummary.aspx

  • For each state found in the form's state list:
    • Create a POST request to RPTApplicationSummary.aspx passing the selected state and the __VIEWSTATE value
  • For each District found in the resulting page:
    • Issue a POST request to RPTApplicationSummary.aspx passing the selected state, selected district and __VIEWSTATE value
  • For each ULB found in the resulting page:
    • Issue a POST request to RPTApplicationSummary.aspx passing the selected state, selected district, selected ULB and __VIEWSTATE value
  • Scrape the resulting pages ward wise appending data to a CSV file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.