Code Monkey home page Code Monkey logo

pythonscrapybasicsetup's Introduction

PythonScrapyBasicSetup

Basic setup with random user agents and proxy addresses for Python Scrapy Framework.

Setup

1. Install Scrapy Framework
pip install Scrapy

Detailed installation guide

2. Install Beautiful Soup 4 for proxy middleware based on proxydocker lists
pip install beautifulsoup4

Detailed installation guide

3. Install Tor, Stem (controller library for Tor), and Privoxy (HTTP proxy server).
apt-get install tor python-stem privoxy

Hash a password with Tor:

tor --hash-password secretPassword

Then copy a hashed password and paste it with control port to /etc/tor/torrc:

ControlPort 9051
HashedControlPassword 16:72C8ADB0E34F8DA1606BB154886604F708236C0D0A54557A07B00CAB73

Restart Tor:

sudo /etc/init.d/tor restart

Enable Privoxy forwarding by adding next line to /etc/privoxy/config:

forward-socks5 / localhost:9050 .

Restart Privoxy:

sudo /etc/init.d/privoxy restart

Both Tor and Privoxy should be up & running (check netstat -l). If you used different password or control port, update settings.py.

If you get some errors regarding the pyOpenSSL (check this issue), try to downgrade the Twisted engine:

pip install Twisted==16.4.1

Usage

To see what it does just:

python run.py

Project contains three middleware classes in middlewares directory. ProxyMiddleware downloads IP proxy addresses and before every process request chooses one randomly. TorMiddleware has a similar purpose, but it relies on Tor network. RandomUserAgentMiddleware downloads user agent strings and saves them into 'USER_AGENT_LIST' settings list. It also selects one UA randomly before every process request. Middlewares are activated in settings.py file. This project also contains two spiders just for testing purposes, spiders/iptester.py and spiders/uatester.py. You can run them individually:

scrapy crawl UAtester
scrapy crawl IPtester

run.py file is a also good example how to include and run your spiders sequentially from one script.

If you have any questions or problems, feel free to create a new issue. Scrap responsibly!

pythonscrapybasicsetup's People

Contributors

matejbasic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pythonscrapybasicsetup's Issues

Connection refused, setup in Docker

I setup an image on Ubuntu following the instruction. When I run scrapy it shows error.

root@f70e0c962d44:/app/PythonScrapyBasicSetup# scrapy crawl UAtester
2019-05-23 15:05:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: PythonScrapyBasicSetup)
2019-05-23 15:05:22 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Linux-4.9.125-linuxkit-x86_64-with-Ubuntu-18.04-bionic
2019-05-23 15:05:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'PythonScrapyBasicSetup', 'CONCURRENT_REQUESTS': 32, 'COOKIES_ENABLED': False, 'DNS_TIMEOUT': 10, 'DOWNLOAD_TIMEOUT': 24, 'NEWSPIDER_MODULE': 'PythonScrapyBasicSetup.spiders', 'RETRY_HTTP_CODES': [500, 502, 503, 504], 'SPIDER_MODULES': ['PythonScrapyBasicSetup.spiders'], 'TELNETCONSOLE_ENABLED': False}
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'PythonScrapyBasicSetup.middlewares.user_agent.RandomUserAgentMiddleware',
 'PythonScrapyBasicSetup.middlewares.proxy.TorProxyMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-23 15:05:22 [scrapy.core.engine] INFO: Spider opened
2019-05-23 15:05:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-23 15:05:22 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://whatsmyuseragent.org/> (failed 1 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://whatsmyuseragent.org/> (failed 2 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://whatsmyuseragent.org/> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [scrapy.core.scraper] ERROR: Error downloading <GET http://whatsmyuseragent.org/>

And the Jetstar -l shows:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 localhost:9050          0.0.0.0:*               LISTEN
tcp        0      0 localhost:9051          0.0.0.0:*               LISTEN
Active UNIX domain sockets (only servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  2      [ ACC ]     STREAM     LISTENING     390389   /var/run/tor/socks
unix  2      [ ACC ]     STREAM     LISTENING     390392   /var/run/tor/control

Did I do anything wrong?

from spider.proxies import proxiespider

if I try to run 'run.py' I do get

File "run.py", line 8, in <module> from spiders.proxies import ProxieSpider ImportError: No module named proxies

is this file missing or do I have to generate this somehow?

Source path fix for "data/user_agents.xml"

In the user_agent.py file (PythonScrapyBasicSetup/PythonScrapyBasicSetup/middlewares/user_agent.py), the source_path variable should point to "../data/user_agents.xml" instead of "data/user_agents.xml" to correctly reflect the tree structure. Small change but wanted to let you know.

'NoneType' object has no attribute 'tbody'

From a fresh ubuntu install on my serv, after following all steps from the read.me file, I get the following error when I run
python run.py

'NoneType' object has no attribute 'tbody'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.