matejbasic / pythonscrapybasicsetup Goto Github PK
View Code? Open in Web Editor NEWBasic setup with random user agents and IP addresses for Python Scrapy Framework.
License: MIT License
Basic setup with random user agents and IP addresses for Python Scrapy Framework.
License: MIT License
I setup an image on Ubuntu following the instruction. When I run scrapy it shows error.
root@f70e0c962d44:/app/PythonScrapyBasicSetup# scrapy crawl UAtester
2019-05-23 15:05:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: PythonScrapyBasicSetup)
2019-05-23 15:05:22 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.9.125-linuxkit-x86_64-with-Ubuntu-18.04-bionic
2019-05-23 15:05:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'PythonScrapyBasicSetup', 'CONCURRENT_REQUESTS': 32, 'COOKIES_ENABLED': False, 'DNS_TIMEOUT': 10, 'DOWNLOAD_TIMEOUT': 24, 'NEWSPIDER_MODULE': 'PythonScrapyBasicSetup.spiders', 'RETRY_HTTP_CODES': [500, 502, 503, 504], 'SPIDER_MODULES': ['PythonScrapyBasicSetup.spiders'], 'TELNETCONSOLE_ENABLED': False}
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'PythonScrapyBasicSetup.middlewares.user_agent.RandomUserAgentMiddleware',
'PythonScrapyBasicSetup.middlewares.proxy.TorProxyMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-23 15:05:22 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-23 15:05:22 [scrapy.core.engine] INFO: Spider opened
2019-05-23 15:05:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-23 15:05:22 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://whatsmyuseragent.org/> (failed 1 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://whatsmyuseragent.org/> (failed 2 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [root] INFO: Using proxy: http://127.0.0.1:8118
2019-05-23 15:05:23 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://whatsmyuseragent.org/> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2019-05-23 15:05:23 [scrapy.core.scraper] ERROR: Error downloading <GET http://whatsmyuseragent.org/>
And the Jetstar -l
shows:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:9050 0.0.0.0:* LISTEN
tcp 0 0 localhost:9051 0.0.0.0:* LISTEN
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node Path
unix 2 [ ACC ] STREAM LISTENING 390389 /var/run/tor/socks
unix 2 [ ACC ] STREAM LISTENING 390392 /var/run/tor/control
Did I do anything wrong?
From a fresh ubuntu install on my serv, after following all steps from the read.me file, I get the following error when I run
python run.py
'NoneType' object has no attribute 'tbody'
In the user_agent.py file (PythonScrapyBasicSetup/PythonScrapyBasicSetup/middlewares/user_agent.py), the source_path variable should point to "../data/user_agents.xml" instead of "data/user_agents.xml" to correctly reflect the tree structure. Small change but wanted to let you know.
if I try to run 'run.py' I do get
File "run.py", line 8, in <module> from spiders.proxies import ProxieSpider ImportError: No module named proxies
is this file missing or do I have to generate this somehow?
It would be great if we specify the TOR exit nodes to use. Can we add this use case?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.