Code Monkey home page Code Monkey logo

tumblrspider's Introduction

TumblrSpider

使用scrapy编写的python爬虫,爬取汤不热上用户发布的图片与视频,下载到本地。

项目结构

  • 爬虫:tbr.py

    1. 利用tumblr的一个接口:https://username.tumblr.com/api/read/json?start=0&num=200 获取用户post的内容。
    2. 获取用户post的视频或图片url。
    3. 若是reblogged的内容则将被转发的该用户加入爬取,可设置爬取深度。
  • 中间件: middlewares.py

    1. 设置代理,因为某种原因,不能直接上tumblr,所以需要科学上网后才行,ssr开全局模式后可以无需代理直接爬,若是PAC模式则需要添加本地代理。
    2. 也可以直接添加国外代理IP
  • items: items.py

    1. 三个字段,分别为file_url, file_pathfile_type
  • 下载管道: pipelines.py

    1. scrapy文件下载两种方式,用FilesPipeline或者requests。
    2. TumblrspiderPipeline 是用文件pipeline写的pipeline。
    3. MyFilesPipeline 是用requests方式写的pipeline。
    4. 相同网络环境下,前者比后者速度快,所以使用第一种pipeline就行。

项目依赖

  • scrapy
  • requests
  • ssr(或其他科学上网工具)

使用方法

  • 确保自己的电脑能够访问 https://www.tumblr.com/
  • ./tumblrSpider/tumblrSpider/spiders/tbr.py 文件中,在start_urls中填入一个种子用户的主页地址。max_depth 可设置最大爬取深度。
  • ./tumblrSpider 路径下, 使用命令 scrapy crawl tbr

tumblrspider's People

Contributors

ice-tong avatar pray3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tumblrspider's Issues

vps上跑出现问题,盼解答

2018-12-04 11:05:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: tumblrSpider)
2018-12-04 11:05:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-12-04 11:05:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tumblrSpider.spiders', 'SPIDER_MODULES': ['tumblrSpider.spiders'], 'BOT_NAME': 'tumblrSpider'}
2018-12-04 11:05:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2018-12-04 11:05:45 [boto] DEBUG: Retrieving credentials from metadata server.
2018-12-04 11:05:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, LocalProxySpiderMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-12-04 11:05:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-12-04 11:05:45 [scrapy] INFO: Enabled item pipelines: TumblrspiderPipeline
2018-12-04 11:05:45 [scrapy] INFO: Spider opened
2018-12-04 11:05:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-04 11:05:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
myusername
2018-12-04 11:05:46 [scrapy] ERROR: Error downloading <GET https://myusername.tumblr.com/api/read/json?start=0&num=200>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 555, in connect
timeout=self._timeout, bindAddress=self._bindAddress)
File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 482, in connectTCP
c = tcp.Connector(host, port, factory, timeout, bindAddress, self)
File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 1165, in init
if abstract.isIPv6Address(host):
File "/usr/lib/python2.7/dist-packages/twisted/internet/abstract.py", line 522, in isIPv6Address
if '%' in addr:
TypeError: argument of type 'NoneType' is not iterable
2018-12-04 11:05:46 [scrapy] INFO: Closing spider (finished)
2018-12-04 11:05:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/exceptions.TypeError': 1,
'downloader/request_bytes': 513,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 12, 4, 11, 5, 46, 172847),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 12, 4, 11, 5, 45, 960112)}
2018-12-04 11:05:46 [scrapy] INFO: Spider closed (finished)


项目有三个tumblrSpider目录,是在最深的那个目录执行么? 使用python2还是3,如果在国外vps运行,不需要代理的话要怎么配置,我只修改了初始地址和深度为2,跑下来出错,python2.
好久没用python了,看得头晕😵。谢谢

直接出墙

可以直接出墙了,怎么把代理取消掉啊?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.