Code Monkey home page Code Monkey logo

rotatinguseragentmiddleware's Introduction

RotatingUserAgentMiddleware

Dowloader middleware for scrapy to allow rotating user agent strings.

Usage

  1. Create a scrapy project with scrapy startproject <project_name>
  2. Copy/Paste the content of rotatinguseragent.py (from this repo) into the middlewares.py file of your scraper project.
  3. In the settings.py file of your scraper project add the following lines:
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    "<your_crawler_name>.middlewares.RotatingUserAgentsMiddleware": 500,
}
ROTATING_USER_AGENTS = ["user_agent_1", "user_agent_2", "user_agent_3", "user_agent_4"]
ROTATING_USER_AGENTS_SHUFFLE = False

Settings

With the ROTATING_USER_AGENTS setting you can define the user agents which are rotated. You can either simply specify a list of user agent strings or you can pass the path (as pathlib.Path or str) to a file which contains the user agent strings. The file should contain exactly one user_Agetn per line or shouuld be a json in the form of [{'os': ..., 'software': ..., 'user_agent_string': ...}, ...].

Getting user agents

This repository also contains a simple crawler which gets you a list of popular useragents from https://developers.whatismybrowser.com/useragents/explore/software_type_specific/web-browser/1 and the following pages. You can filter the user agents which are scraped by passing the following arguments:

wanted_oss: A list of strings (passed as single string seperated by commas) which the OS column must contain. By default all oss are wanted.

wanted_softwares: A list of strings (passed as single string seperated by commas) which the software column must contain. By default all softwares are wanted.

pages: The number of pages which are used to get user_agents.

Example: scrapy crawl wimb -O useragents_sample.json -a wanted_oss=ios,windows -a wanted_softwares=chrome,firefox -a pages=5

Related projects

This repo was inspired by the folloeing projects: https://github.com/scrapedia/scrapy-useragents https://github.com/rejoiceinhope/crawler-demo/tree/master/crawling-basic/scrapy_user_agents

rotatinguseragentmiddleware's People

Contributors

m0dd0 avatar

Stargazers

wakary avatar Andrew Shrout avatar

Watchers

 avatar

rotatinguseragentmiddleware's Issues

ebk moduel

ebk.middlewares.RotatingUserAgentsMiddleware

Hi - I was wondering what package this is? This module is missing, and I cannot find it anywhere.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.