Code Monkey home page Code Monkey logo

findtrip's Introduction

作者声明:此项目不在维护

本项目代码及相关库已经是很旧的版本了,不再推荐使用此项目用来学习爬虫


Findtrip说明文档

介绍

Findtrip是一个基于Scrapy的机票爬虫,目前整合了国内两大机票网站(去哪儿 + 携程)

Introduction

Findtrip is a webspider for flight tickets by Scrapy,which contains two major china ticket websites ---- Qua & Ctrip

安装

在用户目录下执行,将代码clone到本地

git clone https://github.com/fankcoder/findtrip.git

所需运行环境,请看 ./requirements.txt

本程序使用selenium+ phantomjs模拟浏览器行为获取数据,phantomjs浏览器下载地址(当然使用Firefox也可以,不过打开速度就会慢很多)

http://npm.taobao.org/dist/phantomjs

数据库使用Mongodb存储,运行需要安装Mongodb,安装传送门

https://www.mongodb.org/downloads

如果仅仅作为测试不需要使用Mongodb,可以注释settings.py下对应行

'''
ITEM_PIPELINES = {
    'findtrip.pipelines.MongoDBPipeline': 300,
}

MONGODB_HOST = 'localhost' # Change in prod
MONGODB_PORT = 27017 # Change in prod
MONGODB_DATABASE = "findtrip" # Change in prod
MONGODB_COLLECTION = "qua"
MONGODB_USERNAME = "" # Change in prod
MONGODB_PASSWORD = "" # Change in prod
'''

运行

以下命令统一运行在findtrip/目录下,与scrapy.cfg文件同级目录

去哪儿网单爬,终端输入

scrapy crawl Qua

携程网单爬,终端输入

scrapy crawl Ctrip

去哪儿,携程多爬,同时爬取,终端输入

scrapy crawlall

部分json数据

去哪儿网

[{"airports": ["NAY", "Nanyuan", "XMN", "Xiamen"], "company": ["China", "KN5927(73S)"], "site": "Qua", "flight_time": ["4:00", "PM", "7:00", "PM"], "passtime": ["3h"], "price": ["\u00a5", "689"]},
{"airports": ["PEK", "Beijing", "RIZ", "Rizhao", "RIZ", "Rizhao", "XMN", "Xiamen"], "company": ["Shandong", "SC4678(738)", "Same", "Shandong", "SC4678(738)"], "site": "Qua", "flight_time": ["3:20", "PM", "4:50", "PM", "45m", "5:35", "PM", "8:05", "PM"], "passtime": ["1h30m", "2h30m"], "price": ["\u00a5", "712"]},...]

携程网

[{"flight_time": [["10:30", "20:50"], ["12:15", "22:20"]], "price": ["\u00a5", "580"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2"], ["\u5357\u4eac\u7984\u53e3\u56fd\u9645\u673a\u573aT2", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU2891", "\u4e1c\u65b9\u822a\u7a7a", "MU728"], "site": "Ctrip"},
{"flight_time": [["11:05", "17:55"], ["12:50", "19:50"]], "price": ["\u00a5", "610"], "airports": [["\u9ad8\u5d0e\u56fd\u9645\u673a\u573aT4", "\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a"], ["\u5408\u80a5\u65b0\u6865\u56fd\u9645\u673a\u573a", "\u9996\u90fd\u56fd\u9645\u673a\u573aT2"]], "company": ["\u4e1c\u65b9\u822a\u7a7a", "MU5169", "\u4e1c\u65b9\u822a\u7a7a", "MU5171"], "site": "Ctrip"},...]

findtrip's People

Contributors

fankcoder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

findtrip's Issues

数据库问题

请问数据库能够换成MySQL吗,如果可以的话需要改动哪些代码呢?多谢指导。

这是什么问题呢

Traceback (most recent call last):
File "e:\program files (x86)\python\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "e:\program files (x86)\python\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "E:\Program Files (x86)\Python\Scripts\scrapy.exe_main
.py", line 9, in
File "e:\program files (x86)\python\lib\site-packages\scrapy\cmdline.py", line 129, in execute
cmds = _get_commands_dict(settings, inproject)
File "e:\program files (x86)\python\lib\site-packages\scrapy\cmdline.py", line 51, in _get_commands_dict
cmds.update(_get_commands_from_module(cmds_module, inproject))
File "e:\program files (x86)\python\lib\site-packages\scrapy\cmdline.py", line 30, in _get_commands_from_module
for cmd in _iter_command_classes(module):
File "e:\program files (x86)\python\lib\site-packages\scrapy\cmdline.py", line 20, in iter_command_classes
for module in walk_modules(module_name):
File "e:\program files (x86)\python\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "e:\program files (x86)\python\lib\importlib_init
.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 955, in _find_and_load_unlocked
File "", line 665, in _load_unlocked
File "", line 674, in exec_module
File "", line 781, in get_code
File "", line 741, in source_to_code
File "", line 219, in _call_with_frames_removed
File "E:\Cache\1\findtrip\commands\crawlall.py", line 34
spider_loader = self.crawler_process.spider_loader
^
TabError: inconsistent use of tabs and spaces in indentation

TabError: inconsistent use of tabs and spaces in indentation

使用scrapy crawlall语句后,出现以下错误
Traceback (most recent call last): File "/usr/local/bin/scrapy", line 11, in sys.exit(execute()) File "/usr/local/lib/python3.6/dist-packages/scrapy/cmdline.py", line 125, in execute cmds = _get_commands_dict(settings, inproject) File "/usr/local/lib/python3.6/dist-packages/scrapy/cmdline.py", line 56, in _get_commands_dict cmds.update(_get_commands_from_module(cmds_module, inproject)) File "/usr/local/lib/python3.6/dist-packages/scrapy/cmdline.py", line 33, in _get_commands_from_module for cmd in _iter_command_classes(module): File "/usr/local/lib/python3.6/dist-packages/scrapy/cmdline.py", line 22, in _iter_command_classes for module in walk_modules(module_name): File "/usr/local/lib/python3.6/dist-packages/scrapy/utils/misc.py", line 73, in walk_modules submod = import_module(fullpath) File "/usr/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 955, in _find_and_load_unlocked File "", line 665, in _load_unlocked File "", line 674, in exec_module File "", line 781, in get_code File "", line 741, in source_to_code File "", line 219, in _call_with_frames_removed File "/root/findtrip/findtrip/findtrip/commands/crawlall.py", line 34 spider_loader = self.crawler_process.spider_loader ^ TabError: inconsistent use of tabs and spaces in indentation

规范提交

你这居然把日志文件都提交上来了么,还有pyc的文件

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.