Code Monkey home page Code Monkey logo

ajax_crawler's Introduction

ajax_crawler

A flexible web crawler based on Scrapy for fetching most of Ajax or other various types of web pages.

Easy to use: To customize a new web crawler, You just need to write a config file and run.

Usage

  • Edit A Config File In The 'Configs' Directory
cd configs
touch xxx.cfg
vim xxx.cfg

like this

[xxx] # crawler name, should be the same as config file name.
allowed_domains = dianping.com # domain name, can be a list.
start_urls = http://www.dianping.com/search/category/2/45/g152 # start url, should be a certain url.
list_url_pattern = .*category/2/45/g152[p\d]* # list url pattern # list url patern, you can use regular expressions here.
list_restrict_xpaths = '<<//div[@class="page"]//a/@href>>' # list restrict xpaths, we use this to find item urls.
list_content = list,item # decide what kind of content you can find in the list restrict xpaths.
item_url_pattern = .*shop/\d+ # item url patter, you can use regular expressions here.
item_restrict_xpaths = <<//div[@class="tit"]>> # item restrict xpaths, we use this to find item contents.
item_content = name,address,region,intro,phone_num,cover_image,hours,sport # decide what field names can find in the item_restrict_xpaths.
#item_incremental = yes # decide this crawler should be incremental (should use cache)
item_name_xpaths = <<//h1[@class="shop-title"]/text()>> # we can find item content in the item field xpaths
item_address_xpaths = <<//span[@itemprop="street-address"]/text()>>
item_region_xpaths = <<//span[@class="region"]/text()>>
item_phone_num_xpaths = <<//span[@itemprop="tel"]/text()>>
item_cover_image_xpaths = <<//img[@itemprop="photo"]/@src>>
item_hours_xpaths = <<//div[@class="desc-info"]//ul/li/span[@class="J_full-cont"]/text()>>
item_sport_xpaths = "羽毛球" # also can be a certain string
download_delay = 5 # downlaod delay to reduce crawling frequency
#js_parser = on # decide whether to use WebKit parsing js and rerendering web pages
  • Then Just Run The Crawler
scrapy crawl xxx

ajax_crawler's People

Contributors

heartfly avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.