Code Monkey home page Code Monkey logo

t66y-spider's Introduction

t66y-spider

A crawler for images, torrents and videos on t66y.com

草榴社区爬虫,主要收集達蓋爾的旗幟内的图片,也可用于其他板块内的种子,视频或者图片。

用法

git clone https://github.com/Nymphet/t66y-spider.git

cd t66ySpider

# 爬達蓋爾的旗幟的帖子,保存帖子标题和图片地址

scrapy crawl DaGaiEr

# 爬新時代的我們的帖子,保存帖子标题和图片地址

scrapy crawl XinShiDai

# 收集亚洲无码原创区帖子内的图片地址和种子链接,也保存帖子标题和帖子url

scrapy crawl YaZhouWuMa

# 亚洲无码转帖交流区,保存图片地址,种子链接,帖子标题,帖子url

scrapy crawl YaZhouWuMaZhuanTie

# 亚洲有码原创区,同上

scrapy crawl YaZhouYouMa

# 亚洲有码转帖交流区,同上

scrapy crawl YaZhouYouMaZhuanTie

scrapy 支持搭配 shadowsocks, tor 等等各种 socks 代理和各种 http 代理,如需使用代理,可在配置文件中设置, 也可以直接用 proxychains-ng 一类工具。例如使用 proxychians-ng 时,在每个命令前加上 proxychians4 即可

pipeline 还没写,需要把输出保存到文件的话加一个 -o 参数,详情可参考 scrapy 文档,默认格式为 jsonlines。命令格式参考下面这条

proxychains4 scrapy crawl DaGaiEr -o DaGaiErDeQiZhi.jsonlines

数据格式

默认格式为 jsonline,python 可以用 ast.literal_eval 读取。各爬虫抓的数据不同,可以查看 items.py 内的定义。目前主要有:

DaGaiEr     : 't_title'         : string, 标题
              't_image_list'    : list  , 大图源地址

XinShiDai   : 't_title'         : string, 标题
              't_image_list'    : list  , 大图源地址
              
YaZhouWuMa  : 't_title'         : string, 标题
              't_url'           : string, url
              't_image_list'    : list  , 大图源地址(会自动把 imgchili 和 imagetwist 的缩略图转为原图)
              't_torrent_list'  : list  , 种子地址

YaZhouYouMa : 't_title'         : string, 标题
              't_url'           : string, url
              't_image_list'    : list  , 大图源地址(会自动把 imgchili 和 imagetwist 的缩略图转为原图)
              't_torrent_list'  : list  , 种子地址

YaZhouWuMaZhuanTie 和 YaZhouYouMaZhuanTie 分别是上面两个板块的子板,直接使用父板块的数据格式。

其它

草榴社区的种子都是传到 rmdown 的,这个爬虫只抽取种子地址,不自动下载种子。如需自动下载,可调用 rmdown.pl

图片自动下载默认没有开启,如需开启,在所有文件中把所有 t_image_list 替换为 image_urls, 在 settings.py 中加入两行

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} 
IMAGES_STORE = '/存储目录'

t66y-spider's People

Contributors

nymphet avatar xihajuan2010 avatar

Watchers

chanqi4444 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.