Code Monkey home page Code Monkey logo

web-scrape-top-discord-bots's Introduction

Project: Top.gg, Custom ItemLoader

In this project we are going to use scrapy to web scrape the top bots on Top.gg to create a dataset of all bots listed on the site.

Example

https://top.gg/list/top?page=3

Targets

From each page we will scrape:

  • Bot name
  • Bot description
  • Bot image url
  • Number of servers
  • Number of votes
  • Its rank
  • URL for the bot's page on top.gg
  • Tags

From each bot's page:

  • Bot's Website URL
  • Invite link
  • Support server
  • Creator
  • Long description
  • Prefix

Primary Concerns

This website uses Cloudflare DDoS protection. We must bypass this in order to scrape anything.

We are using cfscrape

https://github.com/Anorov/cloudflare-scrape

As of July 2020 this pull request works:

Anorov/cloudflare-scrape#373

pip install https://github.com/Sraq-Zit/cloudflare-scrape/archive/master.zip

Silly Mistakes

  • forgetting the text argument in

    selector_obj = Selector(text=html)

Forgot to do:

  • Process numerical fields to numbers (eg. remove commas, make int)

  • Save the top.gg bot ID's

Things Learned

https://stackoverflow.com/questions/54102498/load-item-fileds-with-itemloader-across-multiple-responses

I learned how to carry over items between requests using the meta tag and a custom item loader class. Without it, our spider only yielded the items from parse_summary method but not the items from parse method.

Create loader.py in the top_gg folder

from scrapy.loader import ItemLoader as ScrapyItemLoader

class ItemLoader(ScrapyItemLoader):
    """ Extended Loader
        for Selector resetting.
        """

    def reset(self, selector=None, response=None):
        if response is not None:
            if selector is None:
                selector = self.default_selector_class(response)
            self.selector = selector
            self.context.update(selector=selector, response=response)
        elif selector is not None:
            self.selector = selector
            self.context.update(selector=selector)

Import it into your spider

from ..loader import ItemLoader

class TheLoader(ItemLoader):
    pass

Make the loader object using the custom class

loader = TheLoader(item=TopGgItem(), selector=listing, response=response)

Pass it on in the response meta

            yield scrapy.Request(
                url = abs_url,
                cookies = token[0],
                headers = {
                    'User-Agent':token[1]
                },
                callback = self.parse_bot_page,
                meta = {
                    'loader':loader
                }
            )

And finally reset the loader.

    def parse_bot_page(self, response):
        loader = response.meta['loader']
        # rebind ItemLoader to new Selector instance
        #loader.reset(selector=response.selector, response=response)
        # skipping the selector will default to response.selector, like ItemLoader
        loader.reset(response=response)

Cloudflare scrape

Many websites are difficult to scrape because of this Cloudflare challenge.

A package for solving cloudflare challenges is called cloudflare-scrape.

https://github.com/Anorov/cloudflare-scrape

However, the master branch is outdated. This did not work for my project. Our savior is one of the pull requests by Sraq-Zit

Anorov/cloudflare-scrape#373

To install his fixed cloudflare-scrape just install the package this way:

pip install https://github.com/Sraq-Zit/cloudflare-scrape/archive/master.zip

Now just import cfscrape into your spider, and follow the example:

    def start_requests(self):
        url = 'https://your-website-here.com'
        scraper = cfscrape.create_scraper()
        token = scraper.get_tokens(url)
        
        yield scrapy.Request(
            url=url,
            cookies=token[0],
            headers={
                'User-Agent':token[1]
            },
            callback = self.parse,
            meta= {
                'currentPage': 1,
                'token':token
            }
            )

get_tokens returns a tuple with 2 elements: a dictionary (the cookies) and the user agent used. The user agent must match with all the requests.

I also pass on the token within the request meta to carry the cookies and user agent to next requests.

Incorporating cloudflare scrape into middleware

https://github.com/clemfromspace/scrapy-cloudflare-middleware

Rather than writing cfscrape into the spider, we can also use middleware instead.

Rotating proxies

https://github.com/TeamHG-Memex/scrapy-rotating-proxies

Custom proxy direct in spider

Simply add to the meta of a request to make it use a proxy.

Eg.

        yield scrapy.Request(
            url=url,
            callback = self.parse,
            meta= {
                'currentPage':current_page,
                'proxy':'192.178.1.1.:8080'
            }
            )

web-scrape-top-discord-bots's People

Contributors

hypersoph avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.