Code Monkey home page Code Monkey logo

sidecrawl's Introduction

SideCrawl

SideCrawl is a simple web spider extensible (via Module) written with Goliath (EventMachine/Ruby). It gives you the full power of jQuery like (via nokogiri) on the server to parse a big number of pages asynchronously.

Prerequisites

You need to have rvm.

Setup Instructions

$ rvm install 2.0
$ bundle install

Getting Started

Create module

To define rules to retrieve the page elements - you need to create a module. Sidecrawl use sitemap for crawling but you can override easily. See below example

# encoding: utf-8

module Amazon

  module WebsiteSetting
    def init
      @name = "Amazon"
      @description = "Amazon.com"
      @website_url = 'http://www.amazon.com'
      @sources = %w{
        http://www.amazon.com/sitemap_vendor_videos_us.xml
      }
    end
  end

  module PageSetting
    attr_accessor :name, :description, :pictures, :price

    def parse
      @name = @html_doc.at_css('#aiv-content-title').text.strip rescue nil
      @description = @html_doc.at_css('.dv-simple-synopsis').text.strip rescue nil
      @pictures = @html_doc.at_css('.dp-img-bracket img')[:src] rescue nil
      @price = @html_doc.at_css('.dv-button-inner').text.strip.scan(/[0-9]+/).join('.').to_f rescue nil
    end
  end

end

Output

You can change the output format page simply by changing the view (written in RABL).

object @page

attributes :name, :description, :pictures, :price

Environment variables

You can specify environment variables in the file .env

Variables Descriptions
PORT Listening ports
SERVER_URL URL server
RECEIVER_URL URL server receiver
TIMEOUT Timeout
CONCURRENCY_SOURCE Concurrency source
CONCURRENCY_PAGE Concurrency page

Run sidecrawl

Sidecrawl uses foreman. You can specified the number of each process type to run (e.g. web=8). Check out the foreman documentation

$ foreman start web=4

Sidecrawl Guide

Sidecrawl has an API to show the results.

Crawling a website

You can run a crawl task via a rake. See below example

$ rake crawl['amazon']

Performance: MRI, JRuby, Rubinius

SideCrawl isn't tied to a single Ruby runtime - it is able to run on MRI Ruby, JRuby and Rubinius today. Depending on which platform you are working with, you will see different performance characteristics.

sidecrawl's People

Contributors

rgaidot avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

partlab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.