Code Monkey home page Code Monkey logo

kimuraframework's Introduction

Kimurai

UPD. I will soon have a time to work on issues for current 1.4 version and also plan to release new 2.0 version with https://github.com/twalpole/apparition engine.

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites.

Kimurai based on well-known Capybara and Nokogiri gems, so you don't have to learn anything new. Lets see:

# github_spider.rb
require 'kimurai'

class GithubSpider < Kimurai::Base
  @name = "github_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
  @config = {
    user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
    before_request: { delay: 4..7 }
  }

  def parse(response, url:, data: {})
    response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
    end

    if next_page = response.at_xpath("//a[@class='next_page']")
      request_to :parse, url: absolute_url(next_page[:href], base: url)
    end
  end

  def parse_repo_page(response, url:, data: {})
    item = {}

    item[:owner] = response.xpath("//h1//a[@rel='author']").text
    item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
    item[:repo_url] = url
    item[:description] = response.xpath("//span[@itemprop='about']").text.squish
    item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
    item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish
    item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish
    item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish
    item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text

    save_to "results.json", item, format: :pretty_json
  end
end

GithubSpider.crawl!
Run: $ ruby github_spider.rb
I, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Spider: started: github_spider
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled `browser before_request delay`
D, [2018-08-22 13:08:03 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 7 seconds before request...
D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled custom user-agent
D, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2018-08-22 13:08:10 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
I, [2018-08-22 13:08:26 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 1, responses: 1
D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 107968
D, [2018-08-22 13:08:27 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 5 seconds before request...
I, [2018-08-22 13:08:32 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
I, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 2, responses: 2
D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 212542
D, [2018-08-22 13:08:33 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: sleep 4 seconds before request...
I, [2018-08-22 13:08:37 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/jaimeiniesta/metainspector

...

I, [2018-08-22 13:23:07 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Info: visits: requests: 140, responses: 140
D, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720] DEBUG -- github_spider: Browser: driver.current_memory: 204198
I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Browser: driver selenium_chrome has been destroyed

I, [2018-08-22 13:23:08 +0400#15477] [M: 47377500980720]  INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:08:03 +0400, :stop_time=>2018-08-22 13:23:08 +0400, :running_time=>"15m, 5s", :visits=>{:requests=>140, :responses=>140}, :error=>nil}
results.json
[
  {
    "owner": "lorien",
    "repo_name": "awesome-web-scraping",
    "repo_url": "https://github.com/lorien/awesome-web-scraping",
    "description": "List of libraries, tools and APIs for web scraping and data processing.",
    "tags": [
      "awesome",
      "awesome-list",
      "web-scraping",
      "data-processing",
      "python",
      "javascript",
      "php",
      "ruby"
    ],
    "watch_count": "159",
    "star_count": "2,423",
    "fork_count": "358",
    "last_commit": "4 days ago",
    "position": 1
  },

  ...

  {
    "owner": "preston",
    "repo_name": "idclight",
    "repo_url": "https://github.com/preston/idclight",
    "description": "A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.",
    "tags": [

    ],
    "watch_count": "6",
    "star_count": "1",
    "fork_count": "0",
    "last_commit": "on Apr 12, 2012",
    "position": 127
  }
]

Okay, that was easy. How about javascript rendered websites with dynamic HTML? Lets scrape a page with infinite scroll:

# infinite_scroll_spider.rb
require 'kimurai'

class InfiniteScrollSpider < Kimurai::Base
  @name = "infinite_scroll_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]

  def parse(response, url:, data: {})
    posts_headers_path = "//article/h2"
    count = response.xpath(posts_headers_path).count

    loop do
      browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
      response = browser.current_response

      new_count = response.xpath(posts_headers_path).count
      if count == new_count
        logger.info "> Pagination is done" and break
      else
        count = new_count
        logger.info "> Continue scrolling, current count is #{count}..."
      end
    end

    posts_headers = response.xpath(posts_headers_path).map(&:text)
    logger.info "> All posts from page: #{posts_headers.join('; ')}"
  end
end

InfiniteScrollSpider.crawl!
Run: $ ruby infinite_scroll_spider.rb
I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Spider: started: infinite_scroll_spider
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): created browser instance
D, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2018-08-22 13:32:57 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: started get request to: https://infinite-scroll.com/demo/full-page/
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: finished get request to: https://infinite-scroll.com/demo/full-page/
I, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Info: visits: requests: 1, responses: 1
D, [2018-08-22 13:33:03 +0400#23356] [M: 47375890851320] DEBUG -- infinite_scroll_spider: Browser: driver.current_memory: 95463
I, [2018-08-22 13:33:05 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Continue scrolling, current count is 5...
I, [2018-08-22 13:33:18 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Continue scrolling, current count is 9...
I, [2018-08-22 13:33:20 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Continue scrolling, current count is 11...
I, [2018-08-22 13:33:26 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Continue scrolling, current count is 13...
I, [2018-08-22 13:33:28 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Continue scrolling, current count is 15...
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > Pagination is done
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: > All posts from page: 1a - Infinite Scroll full page demo; 1b - RGB Schemes logo in Computer Arts; 2a - RGB Schemes logo; 2b - Masonry gets horizontalOrder; 2c - Every vector 2016; 3a - Logo Pizza delivered; 3b - Some CodePens; 3c - 365daysofmusic.com; 3d - Holograms; 4a - Huebee: 1-click color picker; 4b - Word is Flickity is good; Flickity v2 released: groupCells, adaptiveHeight, parallax; New tech gets chatter; Isotope v3 released: stagger in, IE8 out; Packery v2 released
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Browser: driver selenium_chrome has been destroyed
I, [2018-08-22 13:33:30 +0400#23356] [M: 47375890851320]  INFO -- infinite_scroll_spider: Spider: stopped: {:spider_name=>"infinite_scroll_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 13:32:57 +0400, :stop_time=>2018-08-22 13:33:30 +0400, :running_time=>"33s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}


Features

  • Scrape javascript rendered websites out of box
  • Supported engines: Headless Chrome, Headless Firefox, PhantomJS or simple HTTP requests (mechanize gem)
  • Write spider code once, and use it with any supported engine later
  • All the power of Capybara: use methods like click_on, fill_in, select, choose, set, go_back, etc. to interact with web pages
  • Rich configuration: set default headers, cookies, delay between requests, enable proxy/user-agents rotation
  • Built-in helpers to make scraping easy, like save_to (save items to JSON, JSON lines, or CSV formats) or unique? to skip duplicates
  • Automatically handle requests errors
  • Automatically restart browsers when reaching memory limit (memory control) or requests limit
  • Easily schedule spiders within cron using Whenever (no need to know cron syntax)
  • Parallel scraping using simple method in_parallel
  • Two modes: use single file for a simple spider, or generate Scrapy-like project
  • Convenient development mode with console, colorized logger and debugger (Pry, Byebug)
  • Automated server environment setup (for ubuntu 18.04) and deploy using commands kimurai setup and kimurai deploy (Ansible under the hood)
  • Command-line runner to run all project spiders one by one or in parallel

Table of Contents

Installation

Kimurai requires Ruby version >= 2.5.0. Supported platforms: Linux and Mac OS X.

  1. If your system doesn't have appropriate Ruby version, install it:
Ubuntu 18.04
# Install required packages for ruby-build
sudo apt update
sudo apt install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libreadline6-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev

# Install rbenv and ruby-build
cd && git clone https://github.com/rbenv/rbenv.git ~/.rbenv
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(rbenv init -)"' >> ~/.bashrc
exec $SHELL

git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
exec $SHELL

# Install latest Ruby
rbenv install 2.5.3
rbenv global 2.5.3

gem install bundler
Mac OS X
# Install homebrew if you don't have it https://brew.sh/
# Install rbenv and ruby-build:
brew install rbenv ruby-build

# Add rbenv to bash so that it loads every time you open a terminal
echo 'if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi' >> ~/.bash_profile
source ~/.bash_profile

# Install latest Ruby
rbenv install 2.5.3
rbenv global 2.5.3

gem install bundler
  1. Install Kimurai gem: $ gem install kimurai

  2. Install browsers with webdrivers:

Ubuntu 18.04

Note: for Ubuntu 16.04-18.04 there is available automatic installation using setup command:

$ kimurai setup localhost --local --ask-sudo

It works using Ansible so you need to install it first: $ sudo apt install ansible. You can check using playbooks here.

If you chose automatic installation, you can skip following and go to "Getting To Know" part. In case if you want to install everything manually:

# Install basic tools
sudo apt install -q -y unzip wget tar openssl

# Install xvfb (for virtual_display headless mode, in additional to native)
sudo apt install -q -y xvfb

# Install chromium-browser and firefox
sudo apt install -q -y chromium-browser firefox

# Instal chromedriver (2.44 version)
# All versions located here https://sites.google.com/a/chromium.org/chromedriver/downloads
cd /tmp && wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
rm -f chromedriver_linux64.zip

# Install geckodriver (0.23.0 version)
# All versions located here https://github.com/mozilla/geckodriver/releases/
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
sudo tar -xvzf geckodriver-v0.23.0-linux64.tar.gz -C /usr/local/bin
rm -f geckodriver-v0.23.0-linux64.tar.gz

# Install PhantomJS (2.1.1)
# All versions located here http://phantomjs.org/download.html
sudo apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev
cd /tmp && wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2
sudo mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib
sudo ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2
Mac OS X
# Install chrome and firefox
brew cask install google-chrome firefox

# Install chromedriver (latest)
brew cask install chromedriver

# Install geckodriver (latest)
brew install geckodriver

# Install PhantomJS (latest)
brew install phantomjs

Also, if you want to save scraped items to the database (using ActiveRecord, Sequel or MongoDB Ruby Driver/Mongoid), you need to install database clients/servers:

Ubuntu 18.04

SQlite: $ sudo apt -q -y install libsqlite3-dev sqlite3.

If you want to connect to a remote database, you don't need database server on a local machine (only client):

# Install MySQL client
sudo apt -q -y install mysql-client libmysqlclient-dev

# Install Postgres client
sudo apt install -q -y postgresql-client libpq-dev

# Install MongoDB client
sudo apt install -q -y mongodb-clients

But if you want to save items to a local database, database server required as well:

# Install MySQL client and server
sudo apt -q -y install mysql-server mysql-client libmysqlclient-dev

# Install  Postgres client and server
sudo apt install -q -y postgresql postgresql-contrib libpq-dev

# Install MongoDB client and server
# version 4.0 (check here https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/)
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
# for 16.04:
# echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
# for 18.04:
echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
sudo apt update
sudo apt install -q -y mongodb-org
sudo service mongod start
Mac OS X

SQlite: $ brew install sqlite3

# Install MySQL client and server
brew install mysql
# Start server if you need it: brew services start mysql

# Install Postgres client and server
brew install postgresql
# Start server if you need it: brew services start postgresql

# Install MongoDB client and server
brew install mongodb
# Start server if you need it: brew services start mongodb

Getting to Know

Interactive console

Before you get to know all Kimurai features, there is $ kimurai console command which is an interactive console where you can try and debug your scraping code very quickly, without having to run any spider (yes, it's like Scrapy shell).

$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework
Show output
$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework

D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
D, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2018-08-22 13:42:32 +0400#26079] [M: 47461994677760]  INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
I, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760]  INFO -- : Browser: finished get request to: https://github.com/vifreefly/kimuraframework
D, [2018-08-22 13:42:35 +0400#26079] [M: 47461994677760] DEBUG -- : Browser: driver.current_memory: 201701

From: /home/victor/code/kimurai/lib/kimurai/base.rb @ line 189 Kimurai::Base#console:

    188: def console(response = nil, url: nil, data: {})
 => 189:   binding.pry
    190: end

[1] pry(#<Kimurai::Base>)> response.xpath("//title").text
=> "GitHub - vifreefly/kimuraframework: Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites"

[2] pry(#<Kimurai::Base>)> ls
Kimurai::Base#methods: browser  console  logger  request_to  save_to  unique?
instance variables: @browser  @config  @engine  @logger  @pipelines
locals: _  __  _dir_  _ex_  _file_  _in_  _out_  _pry_  data  response  url

[3] pry(#<Kimurai::Base>)> ls response
Nokogiri::XML::PP::Node#methods: inspect  pretty_print
Nokogiri::XML::Searchable#methods: %  /  at  at_css  at_xpath  css  search  xpath
Enumerable#methods:
  all?         collect         drop        each_with_index   find_all    grep_v    lazy    member?    none?      reject        slice_when  take_while  without
  any?         collect_concat  drop_while  each_with_object  find_index  group_by  many?   min        one?       reverse_each  sort        to_a        zip
  as_json      count           each_cons   entries           first       include?  map     min_by     partition  select        sort_by     to_h
  chunk        cycle           each_entry  exclude?          flat_map    index_by  max     minmax     pluck      slice_after   sum         to_set
  chunk_while  detect          each_slice  find              grep        inject    max_by  minmax_by  reduce     slice_before  take        uniq
Nokogiri::XML::Node#methods:
  <=>                   append_class       classes                 document?             has_attribute?      matches?          node_name=        processing_instruction?  to_str
  ==                    attr               comment?                each                  html?               name=             node_type         read_only?               to_xhtml
  >                     attribute          content                 elem?                 inner_html          namespace=        parent=           remove                   traverse
  []                    attribute_nodes    content=                element?              inner_html=         namespace_scopes  parse             remove_attribute         unlink
  []=                   attribute_with_ns  create_external_subset  element_children      inner_text          namespaced_key?   path              remove_class             values
  accept                before             create_internal_subset  elements              internal_subset     native_content=   pointer_id        replace                  write_html_to
  add_class             blank?             css_path                encode_special_chars  key?                next              prepend_child     set_attribute            write_to
  add_next_sibling      cdata?             decorate!               external_subset       keys                next=             previous          text                     write_xhtml_to
  add_previous_sibling  child              delete                  first_element_child   lang                next_element      previous=         text?                    write_xml_to
  after                 children           description             fragment?             lang=               next_sibling      previous_element  to_html                  xml?
  ancestors             children=          do_xinclude             get_attribute         last_element_child  node_name         previous_sibling  to_s
Nokogiri::XML::Document#methods:
  <<         canonicalize  collect_namespaces  create_comment  create_entity     decorate    document  encoding   errors   name        remove_namespaces!  root=  to_java  url       version
  add_child  clone         create_cdata        create_element  create_text_node  decorators  dup       encoding=  errors=  namespaces  root                slop!  to_xml   validate
Nokogiri::HTML::Document#methods: fragment  meta_encoding  meta_encoding=  serialize  title  title=  type
instance variables: @decorators  @errors  @node_cache

[4] pry(#<Kimurai::Base>)> exit
I, [2018-08-22 13:43:47 +0400#26079] [M: 47461994677760]  INFO -- : Browser: driver selenium_chrome has been destroyed
$

CLI options:

  • --engine (optional) engine to use. Default is mechanize
  • --url (optional) url to process. If url omitted, response and url objects inside the console will be nil (use browser object to navigate to any webpage).

Available engines

Kimurai has support for following engines and mostly can switch between them without need to rewrite any code:

  • :mechanize - pure Ruby fake http browser. Mechanize can't render javascript and don't know what DOM is it. It only can parse original HTML code of a page. Because of it, mechanize much faster, takes much less memory and in general much more stable than any real browser. Use mechanize if you can do it, and the website doesn't use javascript to render any meaningful parts of its structure. Still, because mechanize trying to mimic a real browser, it supports almost all Capybara's methods to interact with a web page (filling forms, clicking buttons, checkboxes, etc).
  • :poltergeist_phantomjs - PhantomJS headless browser, can render javascript. In general, PhantomJS still faster than Headless Chrome (and Headless Firefox). PhantomJS has memory leakage, but Kimurai has memory control feature so you shouldn't consider it as a problem. Also, some websites can recognize PhantomJS and block access to them. Like mechanize (and unlike selenium engines) :poltergeist_phantomjs can freely rotate proxies and change headers on the fly (see config section).
  • :selenium_chrome Chrome in headless mode driven by selenium. Modern headless browser solution with proper javascript rendering.
  • :selenium_firefox Firefox in headless mode driven by selenium. Usually takes more memory than other drivers, but sometimes can be useful.

Tip: add HEADLESS=false ENV variable before command ($ HEADLESS=false ruby spider.rb) to run browser in normal (not headless) mode and see it's window (only for selenium-like engines). It works for console command as well.

Minimum required spider structure

You can manually create a spider file, or use generator instead: $ kimurai generate spider simple_spider

require 'kimurai'

class SimpleSpider < Kimurai::Base
  @name = "simple_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def parse(response, url:, data: {})
  end
end

SimpleSpider.crawl!

Where:

  • @name name of a spider. You can omit name if use single-file spider
  • @engine engine for a spider
  • @start_urls array of start urls to process one by one inside parse method
  • Method parse is the start method, should be always present in spider class

Method arguments response, url and data

def parse(response, url:, data: {})
end
  • response (Nokogiri::HTML::Document object) Contains parsed HTML code of a processed webpage
  • url (String) url of a processed webpage
  • data (Hash) uses to pass data between requests
Example how to use data

Imagine that there is a product page which doesn't contain product category. Category name present only on category page with pagination. This is the case where we can use data to pass category name from parse to parse_product method:

class ProductsSpider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example-shop.com/example-product-category"]

  def parse(response, url:, data: {})
    category_name = response.xpath("//path/to/category/name").text
    response.xpath("//path/to/products/urls").each do |product_url|
      # Merge category_name with current data hash and pass it next to parse_product method
      request_to(:parse_product, url: product_url[:href], data: data.merge(category_name: category_name))
    end

    # ...
  end

  def parse_product(response, url:, data: {})
    item = {}
    # Assign item's category_name from data[:category_name]
    item[:category_name] = data[:category_name]

    # ...
  end
end

You can query response using XPath or CSS selectors. Check Nokogiri tutorials to understand how to work with response:

browser object

From any spider instance method there is available browser object, which is Capybara::Session object and uses to process requests and get page response (current_response method). Usually you don't need to touch it directly, because there is response (see above) which contains page response after it was loaded.

But if you need to interact with a page (like filling form fields, clicking elements, checkboxes, etc) browser is ready for you:

class GoogleSpider < Kimurai::Base
  @name = "google_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://www.google.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "q", with: "Kimurai web scraping framework"
    browser.click_button "Google Search"

    # Update response to current response after interaction with a browser
    response = browser.current_response

    # Collect results
    results = response.xpath("//div[@class='g']//h3/a").map do |a|
      { title: a.text, url: a[:href] }
    end

    # ...
  end
end

Check out Capybara cheat sheets where you can see all available methods to interact with browser:

request_to method

For making requests to a particular method there is request_to. It requires minimum two arguments: :method_name and url:. An optional argument is data: (see above what for is it). Example:

class Spider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def parse(response, url:, data: {})
    # Process request to `parse_product` method with `https://example.com/some_product` url:
    request_to :parse_product, url: "https://example.com/some_product"
  end

  def parse_product(response, url:, data: {})
    puts "From page https://example.com/some_product !"
  end
end

Under the hood request_to simply call #visit (browser.visit(url)) and then required method with arguments:

request_to
def request_to(handler, url:, data: {})
  request_data = { url: url, data: data }

  browser.visit(url)
  public_send(handler, browser.current_response, request_data)
end

request_to just makes things simpler, and without it we could do something like:

Check the code
class Spider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def parse(response, url:, data: {})
    url_to_process = "https://example.com/some_product"

    browser.visit(url_to_process)
    parse_product(browser.current_response, url: url_to_process)
  end

  def parse_product(response, url:, data: {})
    puts "From page https://example.com/some_product !"
  end
end

save_to helper

Sometimes all that you need is to simply save scraped data to a file format, like JSON or CSV. You can use save_to for it:

class ProductsSpider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example-shop.com/"]

  # ...

  def parse_product(response, url:, data: {})
    item = {}

    item[:title] = response.xpath("//title/path").text
    item[:description] = response.xpath("//desc/path").text.squish
    item[:price] = response.xpath("//price/path").text[/\d+/]&.to_f

    # Add each new item to the `scraped_products.json` file:
    save_to "scraped_products.json", item, format: :json
  end
end

Supported formats:

  • :json JSON
  • :pretty_json "pretty" JSON (JSON.pretty_generate)
  • :jsonlines JSON Lines
  • :csv CSV

Note: save_to requires data (item to save) to be a Hash.

By default save_to add position key to an item hash. You can disable it with position: false: save_to "scraped_products.json", item, format: :json, position: false.

How helper works:

Until spider stops, each new item will be appended to a file. At the next run, helper will clear the content of a file first, and then start again appending items to it.

If you don't want file to be cleared before each run, add option append: true: save_to "scraped_products.json", item, format: :json, append: true

Skip duplicates

It's pretty common when websites have duplicated pages. For example when an e-commerce shop has the same products in different categories. To skip duplicates, there is simple unique? helper:

class ProductsSpider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example-shop.com/"]

  def parse(response, url:, data: {})
    response.xpath("//categories/path").each do |category|
      request_to :parse_category, url: category[:href]
    end
  end

  # Check products for uniqueness using product url inside of parse_category:
  def parse_category(response, url:, data: {})
    response.xpath("//products/path").each do |product|
      # Skip url if it's not unique:
      next unless unique?(:product_url, product[:href])
      # Otherwise process it:
      request_to :parse_product, url: product[:href]
    end
  end

  # Or/and check products for uniqueness using product sku inside of parse_product:
  def parse_product(response, url:, data: {})
    item = {}
    item[:sku] = response.xpath("//product/sku/path").text.strip.upcase
    # Don't save product and return from method if there is already saved item with the same sku:
    return unless unique?(:sku, item[:sku])

    # ...
    save_to "results.json", item, format: :json
  end
end

unique? helper works pretty simple:

# Check string "http://example.com" in scope `url` for a first time:
unique?(:url, "http://example.com")
# => true

# Try again:
unique?(:url, "http://example.com")
# => false

To check something for uniqueness, you need to provide a scope:

# `product_url` scope
unique?(:product_url, "http://example.com/product_1")

# `id` scope
unique?(:id, 324234232)

# `custom` scope
unique?(:custom, "Lorem Ipsum")

Automatically skip all duplicated requests urls

It is possible to automatically skip all already visited urls while calling request_to method, using @config option skip_duplicate_requests: true. With this option, all already visited urls will be automatically skipped. Also check the @config for an additional options of this setting.

storage object

unique? method it's just an alias for storage#unique?. Storage has several methods:

  • #all - display storage hash where keys are existing scopes.
  • #include?(scope, value) - return true if value in the scope exists, and false if not
  • #add(scope, value) - add value to the scope
  • #unique?(scope, value) - method already described above, will return false if value in the scope exists, or return true + add value to the scope if value in the scope not exists.
  • #clear! - reset the whole storage by deleting all values from all scopes.

Handle request errors

It is quite common that some pages of crawling website can return different response code than 200 ok. In such cases, method request_to (or browser.visit) can raise an exception. Kimurai provides skip_request_errors and retry_request_errors config options to handle such errors:

skip_request_errors

You can automatically skip some of errors while requesting a page using skip_request_errors config option. If raised error matches one of the errors in the list, then this error will be caught, and request will be skipped. It is a good idea to skip errors like NotFound(404), etc.

Format for the option: array where elements are error classes or/and hashes. You can use hash format for more flexibility:

@config = {
  skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
}

In this case, provided message: will be compared with a full error message using String#include?. Also you can use regex instead: { error: RuntimeError, message: /404|403/ }.

retry_request_errors

You can automatically retry some of errors with a few attempts while requesting a page using retry_request_errors config option. If raised error matches one of the errors in the list, then this error will be caught and the request will be processed again within a delay.

There are 3 attempts: first: delay 15 sec, second: delay 30 sec, third: delay 45 sec. If after 3 attempts there is still an exception, then the exception will be raised. It is a good idea to try to retry errros like ReadTimeout, HTTPBadGateway, etc.

Format for the option: same like for skip_request_errors option.

If you would like to skip (not raise) error after all retries gone, you can specify skip_on_failure: true option:

@config = {
  retry_request_errors: [{ error: RuntimeError, skip_on_failure: true }]
}

Logging custom events

It is possible to save custom messages to the run_info hash using add_event('Some message') method. This feature helps you to keep track on important things which happened during crawling without checking the whole spider log (in case if you're logging these messages using logger). Example:

def parse_product(response, url:, data: {})
  unless response.at_xpath("//path/to/add_to_card_button")
    add_event("Product is sold") and return
  end

  # ...
end
...
I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640]  INFO -- example_spider: Spider: new event (scope: custom): Product is sold
...
I, [2018-11-28 22:20:19 +0400#7402] [M: 47156576560640]  INFO -- example_spider: Spider: stopped: {:events=>{:custom=>{"Product is sold"=>1}}}

open_spider and close_spider callbacks

You can define .open_spider and .close_spider callbacks (class methods) to perform some action before spider started or after spider has been stopped:

require 'kimurai'

class ExampleSpider < Kimurai::Base
  @name = "example_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def self.open_spider
    logger.info "> Starting..."
  end

  def self.close_spider
    logger.info "> Stopped!"
  end

  def parse(response, url:, data: {})
    logger.info "> Scraping..."
  end
end

ExampleSpider.crawl!
Output
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Spider: started: example_spider
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: > Starting...
D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
D, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2018-08-22 14:26:32 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: started get request to: https://example.com/
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: finished get request to: https://example.com/
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Info: visits: requests: 1, responses: 1
D, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840] DEBUG -- example_spider: Browser: driver.current_memory: 82415
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: > Scraping...
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Browser: driver selenium_chrome has been destroyed
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: > Stopped!
I, [2018-08-22 14:26:34 +0400#6001] [M: 46996522083840]  INFO -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:26:32 +0400, :stop_time=>2018-08-22 14:26:34 +0400, :running_time=>"1s", :visits=>{:requests=>1, :responses=>1}, :error=>nil}

Inside open_spider and close_spider class methods there is available run_info method which contains useful information about spider state:

    11: def self.open_spider
 => 12:   binding.pry
    13: end

[1] pry(example_spider)> run_info
=> {
  :spider_name=>"example_spider",
  :status=>:running,
  :environment=>"development",
  :start_time=>2018-08-05 23:32:00 +0400,
  :stop_time=>nil,
  :running_time=>nil,
  :visits=>{:requests=>0, :responses=>0},
  :error=>nil
}

Inside close_spider, run_info will be updated:

    15: def self.close_spider
 => 16:   binding.pry
    17: end

[1] pry(example_spider)> run_info
=> {
  :spider_name=>"example_spider",
  :status=>:completed,
  :environment=>"development",
  :start_time=>2018-08-05 23:32:00 +0400,
  :stop_time=>2018-08-05 23:32:06 +0400,
  :running_time=>6.214,
  :visits=>{:requests=>1, :responses=>1},
  :error=>nil
}

run_info[:status] helps to determine if spider was finished successfully or failed (possible values: :completed, :failed):

class ExampleSpider < Kimurai::Base
  @name = "example_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def self.close_spider
    puts ">>> run info: #{run_info}"
  end

  def parse(response, url:, data: {})
    logger.info "> Scraping..."
    # Let's try to strip nil:
    nil.strip
  end
end
Output
I, [2018-08-22 14:34:24 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Spider: started: example_spider
D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): created browser instance
D, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2018-08-22 14:34:25 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: started get request to: https://example.com/
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: finished get request to: https://example.com/
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Info: visits: requests: 1, responses: 1
D, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] DEBUG -- example_spider: Browser: driver.current_memory: 83351
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: > Scraping...
I, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400]  INFO -- example_spider: Browser: driver selenium_chrome has been destroyed

>>> run info: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>2.01, :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}

F, [2018-08-22 14:34:26 +0400#8459] [M: 47020523644400] FATAL -- example_spider: Spider: stopped: {:spider_name=>"example_spider", :status=>:failed, :environment=>"development", :start_time=>2018-08-22 14:34:24 +0400, :stop_time=>2018-08-22 14:34:26 +0400, :running_time=>"2s", :visits=>{:requests=>1, :responses=>1}, :error=>"#<NoMethodError: undefined method `strip' for nil:NilClass>"}
Traceback (most recent call last):
        6: from example_spider.rb:19:in `<main>'
        5: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `crawl!'
        4: from /home/victor/code/kimurai/lib/kimurai/base.rb:127:in `each'
        3: from /home/victor/code/kimurai/lib/kimurai/base.rb:128:in `block in crawl!'
        2: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `request_to'
        1: from /home/victor/code/kimurai/lib/kimurai/base.rb:185:in `public_send'
example_spider.rb:15:in `parse': undefined method `strip' for nil:NilClass (NoMethodError)

Usage example: if spider finished successfully, send JSON file with scraped items to a remote FTP location, otherwise (if spider failed), skip incompleted results and send email/notification to slack about it:

Example

Also you can use additional methods completed? or failed?

class Spider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def self.close_spider
    if completed?
      send_file_to_ftp("results.json")
    else
      send_error_notification(run_info[:error])
    end
  end

  def self.send_file_to_ftp(file_path)
    # ...
  end

  def self.send_error_notification(error)
    # ...
  end

  # ...

  def parse_item(response, url:, data: {})
    item = {}
    # ...

    save_to "results.json", item, format: :json
  end
end

KIMURAI_ENV

Kimurai has environments, default is development. To provide custom environment pass KIMURAI_ENV ENV variable before command: $ KIMURAI_ENV=production ruby spider.rb. To access current environment there is Kimurai.env method.

Usage example:

class Spider < Kimurai::Base
  @engine = :selenium_chrome
  @start_urls = ["https://example.com/"]

  def self.close_spider
    if failed? && Kimurai.env == "production"
      send_error_notification(run_info[:error])
    else
      # Do nothing
    end
  end

  # ...
end

Parallel crawling using in_parallel

Kimurai can process web pages concurrently in one single line: in_parallel(:parse_product, urls, threads: 3), where :parse_product is a method to process, urls is array of urls to crawl and threads: is a number of threads:

# amazon_spider.rb
require 'kimurai'

class AmazonSpider < Kimurai::Base
  @name = "amazon_spider"
  @engine = :mechanize
  @start_urls = ["https://www.amazon.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    # Walk through pagination and collect products urls:
    urls = []
    loop do
      response = browser.current_response
      response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
        urls << a[:href].sub(/ref=.+/, "")
      end

      browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
    end

    # Process all collected urls concurrently within 3 threads:
    in_parallel(:parse_book_page, urls, threads: 3)
  end

  def parse_book_page(response, url:, data: {})
    item = {}

    item[:title] = response.xpath("//h1/span[@id]").text.squish
    item[:url] = url
    item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
    item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence

    save_to "books.json", item, format: :pretty_json
  end
end

AmazonSpider.crawl!
Run: $ ruby amazon_spider.rb
I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: started: amazon_spider
D, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2018-08-22 14:48:37 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2018-08-22 14:48:38 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1

I, [2018-08-22 14:48:43 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: in_parallel: starting processing 52 urls within 3 threads
D, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2018-08-22 14:48:43 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
D, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2018-08-22 14:48:44 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Practical-Web-Scraping-Data-Science/dp/1484235819/
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Info: visits: requests: 4, responses: 2
I, [2018-08-22 14:48:45 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 5, responses: 3
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Python-Community-Experience-Distilled/dp/1782164367/
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 6, responses: 4
I, [2018-08-22 14:48:46 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Web-Scraping-Excel-Effective-Scrapes-ebook/dp/B01CMMJGZ8/

...

I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Info: visits: requests: 51, responses: 49
I, [2018-08-22 14:49:10 +0400#13033] [C: 46982320219020]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Scraping-Ice-Life-Bill-Rayburn-ebook/dp/B00C0NF1L8/
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 51, responses: 50
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Php-architects-Guide-Scraping-Author/dp/B010DTKYY4/
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 52, responses: 51
I, [2018-08-22 14:49:11 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Instant-Scraping-Jacob-Ward-2013-07-26/dp/B01FJ1G3G4/
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Info: visits: requests: 53, responses: 52
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982320189640]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/Ship-Tracking-Maritime-Domain-Awareness/dp/B001J5MTOK/
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Info: visits: requests: 53, responses: 53
I, [2018-08-22 14:49:12 +0400#13033] [C: 46982319187320]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed

I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: in_parallel: stopped processing 52 urls within 3 threads, total time: 29s
I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed

I, [2018-08-22 14:49:12 +0400#13033] [M: 46982297486840]  INFO -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 14:48:37 +0400, :stop_time=>2018-08-22 14:49:12 +0400, :running_time=>"35s", :visits=>{:requests=>53, :responses=>53}, :error=>nil}

books.json
[
  {
    "title": "Web Scraping with Python: Collecting More Data from the Modern Web2nd Edition",
    "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491985577/",
    "price": "$26.94",
    "publisher": "O'Reilly Media; 2 edition (April 14, 2018)",
    "position": 1
  },
  {
    "title": "Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, micro services, Docker and AWS",
    "url": "https://www.amazon.com/Python-Web-Scraping-Cookbook-scraping/dp/1787285219/",
    "price": "$39.99",
    "publisher": "Packt Publishing - ebooks Account (February 9, 2018)",
    "position": 2
  },
  {
    "title": "Web Scraping with Python: Collecting Data from the Modern Web1st Edition",
    "url": "https://www.amazon.com/Web-Scraping-Python-Collecting-Modern/dp/1491910291/",
    "price": "$15.75",
    "publisher": "O'Reilly Media; 1 edition (July 24, 2015)",
    "position": 3
  },

  ...

  {
    "title": "Instant Web Scraping with Java by Ryan Mitchell (2013-08-26)",
    "url": "https://www.amazon.com/Instant-Scraping-Java-Mitchell-2013-08-26/dp/B01FEM76X2/",
    "price": "$35.82",
    "publisher": "Packt Publishing (2013-08-26) (1896)",
    "position": 52
  }
]

Note that save_to and unique? helpers are thread-safe (protected by Mutex) and can be freely used inside threads.

in_parallel can take additional options:

  • data: pass with urls custom data hash: in_parallel(:method, urls, threads: 3, data: { category: "Scraping" })
  • delay: set delay between requests: in_parallel(:method, urls, threads: 3, delay: 2). Delay can be Integer, Float or Range (2..5). In case of a Range, delay number will be chosen randomly for each request: rand (2..5) # => 3
  • engine: set custom engine than a default one: in_parallel(:method, urls, threads: 3, engine: :poltergeist_phantomjs)
  • config: pass custom options to config (see config section)

Active Support included

You can use all the power of familiar Rails core-ext methods for scraping inside Kimurai. Especially take a look at squish, truncate_words, titleize, remove, present? and presence.

Schedule spiders using Cron

  1. Inside spider directory generate Whenever config: $ kimurai generate schedule.
schedule.rb
### Settings ###
require 'tzinfo'

# Export current PATH to the cron
env :PATH, ENV["PATH"]

# Use 24 hour format when using `at:` option
set :chronic_options, hours24: true

# Use local_to_utc helper to setup execution time using your local timezone instead
# of server's timezone (which is probably and should be UTC, to check run `$ timedatectl`).
# Also maybe you'll want to set same timezone in kimurai as well (use `Kimurai.configuration.time_zone =` for that),
# to have spiders logs in a specific time zone format.
# Example usage of helper:
# every 1.day, at: local_to_utc("7:00", zone: "Europe/Moscow") do
#   crawl "google_spider.com", output: "log/google_spider.com.log"
# end
def local_to_utc(time_string, zone:)
  TZInfo::Timezone.get(zone).local_to_utc(Time.parse(time_string))
end

# Note: by default Whenever exports cron commands with :environment == "production".
# Note: Whenever can only append log data to a log file (>>). If you want
# to overwrite (>) log file before each run, pass lambda:
# crawl "google_spider.com", output: -> { "> log/google_spider.com.log 2>&1" }

# Project job types
job_type :crawl,  "cd :path && KIMURAI_ENV=:environment bundle exec kimurai crawl :task :output"
job_type :runner, "cd :path && KIMURAI_ENV=:environment bundle exec kimurai runner --jobs :task :output"

# Single file job type
job_type :single, "cd :path && KIMURAI_ENV=:environment ruby :task :output"
# Single with bundle exec
job_type :single_bundle, "cd :path && KIMURAI_ENV=:environment bundle exec ruby :task :output"

### Schedule ###
# Usage (check examples here https://github.com/javan/whenever#example-schedulerb-file):
# every 1.day do
  # Example to schedule a single spider in the project:
  # crawl "google_spider.com", output: "log/google_spider.com.log"

  # Example to schedule all spiders in the project using runner. Each spider will write
  # it's own output to the `log/spider_name.log` file (handled by a runner itself).
  # Runner output will be written to log/runner.log file.
  # Argument number it's a count of concurrent jobs:
  # runner 3, output:"log/runner.log"

  # Example to schedule single spider (without project):
  # single "single_spider.rb", output: "single_spider.log"
# end

### How to set a cron schedule ###
# Run: `$ whenever --update-crontab --load-file config/schedule.rb`.
# If you don't have whenever command, install the gem: `$ gem install whenever`.

### How to cancel a schedule ###
# Run: `$ whenever --clear-crontab --load-file config/schedule.rb`.

  1. Add at the bottom of schedule.rb following code:
every 1.day, at: "7:00" do
  single "example_spider.rb", output: "example_spider.log"
end
  1. Run: $ whenever --update-crontab --load-file schedule.rb. Done!

You can check Whenever examples here. To cancel schedule, run: $ whenever --clear-crontab --load-file schedule.rb.

Configuration options

You can configure several options using configure block:

Kimurai.configure do |config|
  # Default logger has colored mode in development.
  # If you would like to disable it, set `colorize_logger` to false.
  # config.colorize_logger = false

  # Logger level for default logger:
  # config.log_level = :info

  # Custom logger:
  # config.logger = Logger.new(STDOUT)

  # Custom time zone (for logs):
  # config.time_zone = "UTC"
  # config.time_zone = "Europe/Moscow"

  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
  # config.selenium_chrome_path = "/usr/bin/chromium-browser"
  # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
  # config.chromedriver_path = "~/.local/bin/chromedriver"
end

Using Kimurai inside existing Ruby application

You can integrate Kimurai spiders (which are just Ruby classes) to an existing Ruby application like Rails or Sinatra, and run them using background jobs (for example). Check the following info to understand the running process of spiders:

.crawl! method

.crawl! (class method) performs a full run of a particular spider. This method will return run_info if run was successful, or an exception if something went wrong.

class ExampleSpider < Kimurai::Base
  @name = "example_spider"
  @engine = :mechanize
  @start_urls = ["https://example.com/"]

  def parse(response, url:, data: {})
    title = response.xpath("//title").text.squish
  end
end

ExampleSpider.crawl!
# => { :spider_name => "example_spider", :status => :completed, :environment => "development", :start_time => 2018-08-22 18:20:16 +0400, :stop_time => 2018-08-22 18:20:17 +0400, :running_time => 1.216, :visits => { :requests => 1, :responses => 1 }, :items => { :sent => 0, :processed => 0 }, :error => nil }

You can't .crawl! spider in different thread if it still running (because spider instances store some shared data in the @run_info class variable while crawling):

2.times do |i|
  Thread.new { p i, ExampleSpider.crawl! }
end # =>

# 1
# false

# 0
# {:spider_name=>"example_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 18:49:22 +0400, :stop_time=>2018-08-22 18:49:23 +0400, :running_time=>0.801, :visits=>{:requests=>1, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :error=>nil}

So what if you're don't care about stats and just want to process request to a particular spider method and get the returning value from this method? Use .parse! instead:

.parse!(:method_name, url:) method

.parse! (class method) creates a new spider instance and performs a request to given method with a given url. Value from the method will be returned back:

class ExampleSpider < Kimurai::Base
  @name = "example_spider"
  @engine = :mechanize
  @start_urls = ["https://example.com/"]

  def parse(response, url:, data: {})
    title = response.xpath("//title").text.squish
  end
end

ExampleSpider.parse!(:parse, url: "https://example.com/")
# => "Example Domain"

Like .crawl!, .parse! method takes care of a browser instance and kills it (browser.destroy_driver!) before returning the value. Unlike .crawl!, .parse! method can be called from different threads at the same time:

urls = ["https://www.google.com/", "https://www.reddit.com/", "https://en.wikipedia.org/"]

urls.each do |url|
  Thread.new { p ExampleSpider.parse!(:parse, url: url) }
end # =>

# "Google"
# "Wikipedia, the free encyclopedia"
# "reddit: the front page of the internetHotHot"

Keep in mind, that save_to and unique? helpers are not thread-safe while using .parse! method.

Kimurai.list and Kimurai.find_by_name()

class GoogleSpider < Kimurai::Base
  @name = "google_spider"
end

class RedditSpider < Kimurai::Base
  @name = "reddit_spider"
end

class WikipediaSpider < Kimurai::Base
  @name = "wikipedia_spider"
end

# To get the list of all available spider classes:
Kimurai.list
# => {"google_spider"=>GoogleSpider, "reddit_spider"=>RedditSpider, "wikipedia_spider"=>WikipediaSpider}

# To find a particular spider class by it's name:
Kimurai.find_by_name("reddit_spider")
# => RedditSpider

Automated sever setup and deployment

EXPERIMENTAL

Setup

You can automatically setup required environment for Kimurai on the remote server (currently there is only Ubuntu Server 18.04 support) using $ kimurai setup command. setup will perform installation of: latest Ruby with Rbenv, browsers with webdrivers and in additional databases clients (only clients) for MySQL, Postgres and MongoDB (so you can connect to a remote database from ruby).

To perform remote server setup, Ansible is required on the desktop machine (to install: Ubuntu: $ sudo apt install ansible, Mac OS X: $ brew install ansible)

It's recommended to use regular user to setup the server, not root. To create a new user, login to the server $ ssh root@your_server_ip, type $ adduser username to create a user, and $ gpasswd -a username sudo to add new user to a sudo group.

Example:

$ kimurai setup [email protected] --ask-sudo --ssh-key-path path/to/private_key

CLI options:

  • --ask-sudo pass this option to ask sudo (user) password for system-wide installation of packages (apt install)
  • --ssh-key-path path/to/private_key authorization on the server using private ssh key. You can omit it if required key already added to keychain on your desktop (Ansible uses SSH agent forwarding)
  • --ask-auth-pass authorization on the server using user password, alternative option to --ssh-key-path.
  • -p port_number custom port for ssh connection (-p 2222)

You can check setup playbook here

Deploy

After successful setup you can deploy a spider to the remote server using $ kimurai deploy command. On each deploy there are performing several tasks: 1) pull repo from a remote origin to ~/repo_name user directory 2) run bundle install 3) Update crontab whenever --update-crontab (to update spider schedule from schedule.rb file).

Before deploy make sure that inside spider directory you have: 1) git repository with remote origin (bitbucket, github, etc.) 2) Gemfile 3) schedule.rb inside subfolder config (config/schedule.rb).

Example:

$ kimurai deploy [email protected] --ssh-key-path path/to/private_key --repo-key-path path/to/repo_private_key

CLI options: same like for setup command (except --ask-sudo), plus

  • --repo-url provide custom repo url (--repo-url [email protected]:username/repo_name.git), otherwise current origin/master will be taken (output from $ git remote get-url origin)
  • --repo-key-path if git repository is private, authorization is required to pull the code on the remote server. Use this option to provide a private repository SSH key. You can omit it if required key already added to keychain on your desktop (same like with --ssh-key-path option)

You can check deploy playbook here

Spider @config

Using @config you can set several options for a spider, like proxy, user-agent, default cookies/headers, delay between requests, browser memory control and so on:

class Spider < Kimurai::Base
  USER_AGENTS = ["Chrome", "Firefox", "Safari", "Opera"]
  PROXIES = ["2.3.4.5:8080:http:username:password", "3.4.5.6:3128:http", "1.2.3.4:3000:socks5"]

  @engine = :poltergeist_phantomjs
  @start_urls = ["https://example.com/"]
  @config = {
    headers: { "custom_header" => "custom_value" },
    cookies: [{ name: "cookie_name", value: "cookie_value", domain: ".example.com" }],
    user_agent: -> { USER_AGENTS.sample },
    proxy: -> { PROXIES.sample },
    window_size: [1366, 768],
    disable_images: true,
    restart_if: {
      # Restart browser if provided memory limit (in kilobytes) is exceeded:
      memory_limit: 350_000
    },
    before_request: {
      # Change user agent before each request:
      change_user_agent: true,
      # Change proxy before each request:
      change_proxy: true,
      # Clear all cookies and set default cookies (if provided) before each request:
      clear_and_set_cookies: true,
      # Process delay before each request:
      delay: 1..3
    }
  }

  def parse(response, url:, data: {})
    # ...
  end
end

All available @config options

@config = {
  # Custom headers, format: hash. Example: { "some header" => "some value", "another header" => "another value" }
  # Works only for :mechanize and :poltergeist_phantomjs engines (Selenium doesn't allow to set/get headers)
  headers: {},

  # Custom User Agent, format: string or lambda.
  # Use lambda if you want to rotate user agents before each run:
  # user_agent: -> { ARRAY_OF_USER_AGENTS.sample }
  # Works for all engines
  user_agent: "Mozilla/5.0 Firefox/61.0",

  # Custom cookies, format: array of hashes.
  # Format for a single cookie: { name: "cookie name", value: "cookie value", domain: ".example.com" }
  # Works for all engines
  cookies: [],

  # Proxy, format: string or lambda. Format of a proxy string: "ip:port:protocol:user:password"
  # `protocol` can be http or socks5. User and password are optional.
  # Use lambda if you want to rotate proxies before each run:
  # proxy: -> { ARRAY_OF_PROXIES.sample }
  # Works for all engines, but keep in mind that Selenium drivers doesn't support proxies
  # with authorization. Also, Mechanize doesn't support socks5 proxy format (only http)
  proxy: "3.4.5.6:3128:http:user:pass",

  # If enabled, browser will ignore any https errors. It's handy while using a proxy
  # with self-signed SSL cert (for example Crawlera or Mitmproxy)
  # Also, it will allow to visit webpages with expires SSL certificate.
  # Works for all engines
  ignore_ssl_errors: true,

  # Custom window size, works for all engines
  window_size: [1366, 768],

  # Skip images downloading if true, works for all engines
  disable_images: true,

  # Selenium engines only: headless mode, `:native` or `:virtual_display` (default is :native)
  # Although native mode has a better performance, virtual display mode
  # sometimes can be useful. For example, some websites can detect (and block)
  # headless chrome, so you can use virtual_display mode instead
  headless_mode: :native,

  # This option tells the browser not to use a proxy for the provided list of domains or IP addresses.
  # Format: array of strings. Works only for :selenium_firefox and selenium_chrome
  proxy_bypass_list: [],

  # Option to provide custom SSL certificate. Works only for :poltergeist_phantomjs and :mechanize
  ssl_cert_path: "path/to/ssl_cert",

  # Inject some JavaScript code to the browser.
  # Format: array of strings, where each string is a path to JS file.
  # Works only for poltergeist_phantomjs engine (Selenium doesn't support JS code injection)
  extensions: ["lib/code_to_inject.js"],

  # Automatically skip duplicated (already visited) urls when using `request_to` method.
  # Possible values: `true` or `hash` with options.
  # In case of `true`, all visited urls will be added to the storage's scope `:requests_urls`
  # and if url already contains in this scope, request will be skipped.
  # You can configure this setting by providing additional options as hash:
  # `skip_duplicate_requests: { scope: :custom_scope, check_only: true }`, where:
  # `scope:` - use custom scope than `:requests_urls`
  # `check_only:` - if true, then scope will be only checked for url, url will not
  # be added to the scope if scope doesn't contains it.
  # works for all drivers
  skip_duplicate_requests: true,

  # Automatically skip provided errors while requesting a page.
  # If raised error matches one of the errors in the list, then this error will be caught,
  # and request will be skipped.
  # It is a good idea to skip errors like NotFound(404), etc.
  # Format: array where elements are error classes or/and hashes. You can use hash format
  # for more flexibility: `{ error: "RuntimeError", message: "404 => Net::HTTPNotFound" }`.
  # Provided `message:` will be compared with a full error message using `String#include?`. Also
  # you can use regex instead: `{ error: "RuntimeError", message: /404|403/ }`.
  skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }],

  # Automatically retry provided errors with a few attempts while requesting a page.
  # If raised error matches one of the errors in the list, then this error will be caught
  # and the request will be processed again within a delay. There are 3 attempts:
  # first: delay 15 sec, second: delay 30 sec, third: delay 45 sec.
  # If after 3 attempts there is still an exception, then the exception will be raised.
  # It is a good idea to try to retry errros like `ReadTimeout`, `HTTPBadGateway`, etc.
  # Format: same like for `skip_request_errors` option.
  retry_request_errors: [Net::ReadTimeout],

  # Handle page encoding while parsing html response using Nokogiri. There are two modes:
  # Auto (`:auto`) (try to fetch correct encoding from <meta http-equiv="Content-Type"> or <meta charset> tags)
  # Set required encoding manually, example: `encoding: "GB2312"` (Set required encoding manually)
  # Default this option is unset.
  encoding: nil,

  # Restart browser if one of the options is true:
  restart_if: {
    # Restart browser if provided memory limit (in kilobytes) is exceeded (works for all engines)
    memory_limit: 350_000,

    # Restart browser if provided requests limit is exceeded (works for all engines)
    requests_limit: 100
  },

  # Perform several actions before each request:
  before_request: {
    # Change proxy before each request. The `proxy:` option above should be presented
    # and has lambda format. Works only for poltergeist and mechanize engines
    # (Selenium doesn't support proxy rotation).
    change_proxy: true,

    # Change user agent before each request. The `user_agent:` option above should be presented
    # and has lambda format. Works only for poltergeist and mechanize engines
    # (selenium doesn't support to get/set headers).
    change_user_agent: true,

    # Clear all cookies before each request, works for all engines
    clear_cookies: true,

    # If you want to clear all cookies + set custom cookies (`cookies:` option above should be presented)
    # use this option instead (works for all engines)
    clear_and_set_cookies: true,

    # Global option to set delay between requests.
    # Delay can be `Integer`, `Float` or `Range` (`2..5`). In case of a range,
    # delay number will be chosen randomly for each request: `rand (2..5) # => 3`
    delay: 1..3
  }
}

As you can see, most of the options are universal for any engine.

@config settings inheritance

Settings can be inherited:

class ApplicationSpider < Kimurai::Base
  @engine = :poltergeist_phantomjs
  @config = {
    user_agent: "Firefox",
    disable_images: true,
    restart_if: { memory_limit: 350_000 },
    before_request: { delay: 1..2 }
  }
end

class CustomSpider < ApplicationSpider
  @name = "custom_spider"
  @start_urls = ["https://example.com/"]
  @config = {
    before_request: { delay: 4..6 }
  }

  def parse(response, url:, data: {})
    # ...
  end
end

Here, @config of CustomSpider will be deep merged with ApplicationSpider config, so CustomSpider will keep all inherited options with only delay updated.

Project mode

Kimurai can work in project mode (Like Scrapy). To generate a new project, run: $ kimurai generate project web_spiders (where web_spiders is a name of project).

Structure of the project:

.
โ”œโ”€โ”€ config/
โ”‚ย ย  โ”œโ”€โ”€ initializers/
โ”‚ย ย  โ”œโ”€โ”€ application.rb
โ”‚ย ย  โ”œโ”€โ”€ automation.yml
โ”‚ย ย  โ”œโ”€โ”€ boot.rb
โ”‚ย ย  โ””โ”€โ”€ schedule.rb
โ”œโ”€โ”€ spiders/
โ”‚ย ย  โ””โ”€โ”€ application_spider.rb
โ”œโ”€โ”€ db/
โ”œโ”€โ”€ helpers/
โ”‚ย ย  โ””โ”€โ”€ application_helper.rb
โ”œโ”€โ”€ lib/
โ”œโ”€โ”€ log/
โ”œโ”€โ”€ pipelines/
โ”‚ย ย  โ”œโ”€โ”€ validator.rb
โ”‚ย ย  โ””โ”€โ”€ saver.rb
โ”œโ”€โ”€ tmp/
โ”œโ”€โ”€ .env
โ”œโ”€โ”€ Gemfile
โ”œโ”€โ”€ Gemfile.lock
โ””โ”€โ”€ README.md
Description
  • config/ folder for configutation files
    • config/initializers Rails-like initializers to load custom code at start of framework
    • config/application.rb configuration settings for Kimurai (Kimurai.configure do block)
    • config/automation.yml specify some settings for setup and deploy
    • config/boot.rb loads framework and project
    • config/schedule.rb Cron schedule for spiders
  • spiders/ folder for spiders
    • spiders/application_spider.rb Base parent class for all spiders
  • db/ store here all database files (sqlite, json, csv, etc.)
  • helpers/ Rails-like helpers for spiders
    • helpers/application_helper.rb all methods inside ApplicationHelper module will be available for all spiders
  • lib/ put here custom Ruby code
  • log/ folder for logs
  • pipelines/ folder for Scrapy-like pipelines. One file = one pipeline
    • pipelines/validator.rb example pipeline to validate item
    • pipelines/saver.rb example pipeline to save item
  • tmp/ folder for temp. files
  • .env file to store ENV variables for project and load them using Dotenv
  • Gemfile dependency file
  • Readme.md example project readme

Generate new spider

To generate a new spider in the project, run:

$ kimurai generate spider example_spider
      create  spiders/example_spider.rb

Command will generate a new spider class inherited from ApplicationSpider:

class ExampleSpider < ApplicationSpider
  @name = "example_spider"
  @start_urls = []
  @config = {}

  def parse(response, url:, data: {})
  end
end

Crawl

To run a particular spider in the project, run: $ bundle exec kimurai crawl example_spider. Don't forget to add bundle exec before command to load required environment.

List

To list all project spiders, run: $ bundle exec kimurai list

Parse

For project spiders you can use $ kimurai parse command which helps to debug spiders:

$ bundle exec kimurai parse example_spider parse_product --url https://example-shop.com/product-1

where example_spider is a spider to run, parse_product is a spider method to process and --url is url to open inside processing method.

Pipelines, send_item method

You can use item pipelines to organize and store in one place item processing logic for all project spiders (also check Scrapy description of pipelines).

Imagine if you have three spiders where each of them crawls different e-commerce shop and saves only shoe positions. For each spider, you want to save items only with "shoe" category, unique sku, valid title/price and with existing images. To avoid code duplication between spiders, use pipelines:

Example

pipelines/validator.rb

class Validator < Kimurai::Pipeline
  def process_item(item, options: {})
    # Here you can validate item and raise `DropItemError`
    # if one of the validations failed. Examples:

    # Drop item if it's category is not "shoe":
    if item[:category] != "shoe"
      raise DropItemError, "Wrong item category"
    end

    # Check item sku for uniqueness using buit-in unique? helper:
    unless unique?(:sku, item[:sku])
      raise DropItemError, "Item sku is not unique"
    end

    # Drop item if title length shorter than 5 symbols:
    if item[:title].size < 5
      raise DropItemError, "Item title is short"
    end

    # Drop item if price is not present
    unless item[:price].present?
      raise DropItemError, "item price is not present"
    end

    # Drop item if it doesn't contains any images:
    unless item[:images].present?
      raise DropItemError, "Item images are not present"
    end

    # Pass item to the next pipeline (if it wasn't dropped):
    item
  end
end

pipelines/saver.rb

class Saver < Kimurai::Pipeline
  def process_item(item, options: {})
    # Here you can save item to the database, send it to a remote API or
    # simply save item to a file format using `save_to` helper:

    # To get the name of current spider: `spider.class.name`
    save_to "db/#{spider.class.name}.json", item, format: :json

    item
  end
end

spiders/application_spider.rb

class ApplicationSpider < Kimurai::Base
  @engine = :selenium_chrome
  # Define pipelines (by order) for all spiders:
  @pipelines = [:validator, :saver]
end

spiders/shop_spider_1.rb

class ShopSpiderOne < ApplicationSpider
  @name = "shop_spider_1"
  @start_urls = ["https://shop-1.com"]

  # ...

  def parse_product(response, url:, data: {})
    # ...

    # Send item to pipelines:
    send_item item
  end
end

spiders/shop_spider_2.rb

class ShopSpiderTwo < ApplicationSpider
  @name = "shop_spider_2"
  @start_urls = ["https://shop-2.com"]

  def parse_product(response, url:, data: {})
    # ...

    # Send item to pipelines:
    send_item item
  end
end

spiders/shop_spider_3.rb

class ShopSpiderThree < ApplicationSpider
  @name = "shop_spider_3"
  @start_urls = ["https://shop-3.com"]

  def parse_product(response, url:, data: {})
    # ...

    # Send item to pipelines:
    send_item item
  end
end

When you start using pipelines, there are stats for items appears:

Example

pipelines/validator.rb

class Validator < Kimurai::Pipeline
  def process_item(item, options: {})
    if item[:star_count] < 10
      raise DropItemError, "Repository doesn't have enough stars"
    end

    item
  end
end

spiders/github_spider.rb

class GithubSpider < ApplicationSpider
  @name = "github_spider"
  @engine = :selenium_chrome
  @pipelines = [:validator]
  @start_urls = ["https://github.com/search?q=Ruby%20Web%20Scraping"]
  @config = {
    user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
    before_request: { delay: 4..7 }
  }

  def parse(response, url:, data: {})
    response.xpath("//ul[@class='repo-list']/div//h3/a").each do |a|
      request_to :parse_repo_page, url: absolute_url(a[:href], base: url)
    end

    if next_page = response.at_xpath("//a[@class='next_page']")
      request_to :parse, url: absolute_url(next_page[:href], base: url)
    end
  end

  def parse_repo_page(response, url:, data: {})
    item = {}

    item[:owner] = response.xpath("//h1//a[@rel='author']").text
    item[:repo_name] = response.xpath("//h1/strong[@itemprop='name']/a").text
    item[:repo_url] = url
    item[:description] = response.xpath("//span[@itemprop='about']").text.squish
    item[:tags] = response.xpath("//div[@id='topics-list-container']/div/a").map { |a| a.text.squish }
    item[:watch_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Watch')]/a[2]").text.squish.delete(",").to_i
    item[:star_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Star')]/a[2]").text.squish.delete(",").to_i
    item[:fork_count] = response.xpath("//ul[@class='pagehead-actions']/li[contains(., 'Fork')]/a[2]").text.squish.delete(",").to_i
    item[:last_commit] = response.xpath("//span[@itemprop='dateModified']/*").text

    send_item item
  end
end
$ bundle exec kimurai crawl github_spider

I, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Spider: started: github_spider
D, [2018-08-22 15:56:35 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: BrowserBuilder (selenium_chrome): created browser instance
I, [2018-08-22 15:56:40 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/search?q=Ruby%20Web%20Scraping
I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/search?q=Ruby%20Web%20Scraping
I, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 1, responses: 1
D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 116182
D, [2018-08-22 15:56:44 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 5 seconds before request...

I, [2018-08-22 15:56:49 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/lorien/awesome-web-scraping
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/lorien/awesome-web-scraping
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 2, responses: 2
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 217432
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Pipeline: processed: {"owner":"lorien","repo_name":"awesome-web-scraping","repo_url":"https://github.com/lorien/awesome-web-scraping","description":"List of libraries, tools and APIs for web scraping and data processing.","tags":["awesome","awesome-list","web-scraping","data-processing","python","javascript","php","ruby"],"watch_count":159,"star_count":2423,"fork_count":358,"last_commit":"4 days ago"}
I, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: items: sent: 1, processed: 1
D, [2018-08-22 15:56:50 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: sleep 6 seconds before request...

...

I, [2018-08-22 16:11:50 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: started get request to: https://github.com/preston/idclight
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: finished get request to: https://github.com/preston/idclight
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: visits: requests: 140, responses: 140
D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Browser: driver.current_memory: 211713

D, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] DEBUG -- github_spider: Pipeline: starting processing item through 1 pipeline...
E, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980] ERROR -- github_spider: Pipeline: dropped: #<Kimurai::Pipeline::DropItemError: Repository doesn't have enough stars>, item: {:owner=>"preston", :repo_name=>"idclight", :repo_url=>"https://github.com/preston/idclight", :description=>"A Ruby gem for accessing the freely available IDClight (IDConverter Light) web service, which convert between different types of gene IDs such as Hugo and Entrez. Queries are screen scraped from http://idclight.bioinfo.cnio.es.", :tags=>[], :watch_count=>6, :star_count=>1, :fork_count=>0, :last_commit=>"on Apr 12, 2012"}

I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Info: items: sent: 127, processed: 12

I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Browser: driver selenium_chrome has been destroyed
I, [2018-08-22 16:11:51 +0400#1358] [M: 47347279209980]  INFO -- github_spider: Spider: stopped: {:spider_name=>"github_spider", :status=>:completed, :environment=>"development", :start_time=>2018-08-22 15:56:35 +0400, :stop_time=>2018-08-22 16:11:51 +0400, :running_time=>"15m, 16s", :visits=>{:requests=>140, :responses=>140}, :items=>{:sent=>127, :processed=>12}, :error=>nil}

Also, you can pass custom options to pipeline from a particular spider if you want to change pipeline behavior for this spider:

Example

spiders/custom_spider.rb

class CustomSpider < ApplicationSpider
  @name = "custom_spider"
  @start_urls = ["https://example.com"]
  @pipelines = [:validator]

  # ...

  def parse_item(response, url:, data: {})
    # ...

    # Pass custom option `skip_uniq_checking` for Validator pipeline:
    send_item item, validator: { skip_uniq_checking: true }
  end
end

pipelines/validator.rb

class Validator < Kimurai::Pipeline
  def process_item(item, options: {})

    # Do not check item sku for uniqueness if options[:skip_uniq_checking] is true
    if options[:skip_uniq_checking] != true
      raise DropItemError, "Item sku is not unique" unless unique?(:sku, item[:sku])
    end
  end
end

Runner

You can run project spiders one by one or in parallel using $ kimurai runner command:

$ bundle exec kimurai list
custom_spider
example_spider
github_spider

$ bundle exec kimurai runner -j 3
>>> Runner: started: {:id=>1533727423, :status=>:processing, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>nil, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}
> Runner: started spider: custom_spider, index: 0
> Runner: started spider: github_spider, index: 1
> Runner: started spider: example_spider, index: 2
< Runner: stopped spider: custom_spider, index: 0
< Runner: stopped spider: example_spider, index: 2
< Runner: stopped spider: github_spider, index: 1
<<< Runner: stopped: {:id=>1533727423, :status=>:completed, :start_time=>2018-08-08 15:23:43 +0400, :stop_time=>2018-08-08 15:25:11 +0400, :environment=>"development", :concurrent_jobs=>3, :spiders=>["custom_spider", "github_spider", "example_spider"]}

Each spider runs in a separate process. Spiders logs available at log/ folder. Pass -j option to specify how many spiders should be processed at the same time (default is 1).

You can provide additional arguments like --include or --exclude to specify which spiders to run:

# Run only custom_spider and example_spider:
$ bundle exec kimurai runner --include custom_spider example_spider

# Run all except github_spider:
$ bundle exec kimurai runner --exclude github_spider

Runner callbacks

You can perform custom actions before runner starts and after runner stops using config.runner_at_start_callback and config.runner_at_stop_callback. Check config/application.rb to see example.

Chat Support and Feedback

Will be updated

License

The gem is available as open source under the terms of the MIT License.

kimuraframework's People

Contributors

matias-eduardo avatar vifreefly avatar zhustec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kimuraframework's Issues

Selenium Chrome Heroku

Hi Everyone,

Is there any way to solve Heroku's problem with Selenium Chrome engine with Kimurai?

I need to use

@engine = :selenium_chrome

I use build packs for Chrome on Heroku but I still gots error related to file path of Chrome and couldn't find any way to define it in spider settings.

Error message from Heroku:

Selenium::WebDriver::Error::WebDriverError: not a file: "/usr/local/bin/chromedriver"

How do I click on something that isn't a link?

I'm trying to click on an SVG element โ€“ย it is not a button or a link, so Capybara doesn't like it.

I can query for the element using element = response.css('.available') but I can't seem to click on it using

browser.click(element) โ€“ I think it would work if it were a capybara element, not a nokogiri element.

How do I get from the Capybara::Session to a Capybara::Page or Capybara::Node so I can execute .click on it?

How to set language?

I am unable to set the language for Selenium. According to my understanding, these two options are not supported:
@config = {
headers: { "Accept-Language" => "de-DE" }
}

or

options.add_argument("--lang=de-DE")

uninitialized constant Downloader::MovieEnum

I created movie_enum.rb in ./lib/

It not loaded when I start bundle exec kimurai crawl movie_spider

So, I add code to config/initializers/boot.rb under the require pipelines

# require lib
Dir.glob(File.join("./lib", "*.rb"), &method(:require))

it works.

Can't run within a test suite that is using Capybara

This is an interesting challenge:

When attempting to exercise a crawler within the context of a test suite which also runs Capybara for system specs, we run into the problem that both the test suite and kimurai are trying to configure Capybara.

If Kimurai runs first then my system specs fail because eg. because kimurai is specifying xpath as the default selector.

If my system specs run first then the specs using kimurai fail because:

Threadsafe setting cannot be changed once a session is created".

I wonder if these are just incompatible or if there's a way around this?

How to pass argument to Spider

Hi,

First of all thanks for your hard work, Kimurai really helps me a lot. I'm using it in Sinatra app and controlling using web requests. But I just can't find a proper way to pass data(or args) to spider.

I'm using proxies and I have a proxy class and .fetch method in it which is returns a new proxy. What I want to do is pass this proxy to my spider. If I include my proxy class inside spider class, proxy.fetch method only works once. That's not what I want, I want to make every run with different proxy.

Is there way to pass some arguments when calling spider like ExampleSpider.crawl!(foo, bar) ?

Broken load balancers support

Hi,

I've trying to setup a scraper for a website that is under a load balancer. The thing is, from 10 request, theres 1 request at goes to a bad backend and it stalls at SSL negotiation.

I can't find a way to reduce Mechanize Read Timeout (same with selenium_chrome). From stack overflow, this can be done as the following example:

agent = Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive=false
agent.open_timeout=15
agent.read_timeout=15

Is there any way to pass those parameters to Mechanize?

How to download a file? Alternatively, how to pass custom opts to the driver?

Hi there. I'm testing kimurai to try and automate a daily download of a bank statement. I've already managed to get the login working from the console and clicking on the button which fires the download (using browser.click_on). The file gets downloaded but I haven't found any way to control where it gets downloaded.

I found this example on downloading a file with selenium but kimurai doesn't seem to have any "official" way for me to use any custom configuration on the driver.

Do you have any recommendations on how to proceed?

Include Docker support

Its easy to get up and running using Docker (no need to install a bunch of dependencies on a system that you don't know about).

I got Docker working using the following files:

#Dockerfile
FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt install -q -y git unzip wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

ENTRYPOINT ['kimurai']

And its docker-compose.yml:

# 'extends' is not supported in version 3
version: '2'

services:

  base:
    build: ./
    entrypoint: /bin/bash
    working_dir: /app
    volumes:
      - ./:/app

  irb:
    extends: base
    entrypoint: irb
    volumes:
      - ./:/app

  kimurai:
    extends: base
    entrypoint: bundle exec kimurai
    volumes:
      - ./:/app

  crawl:
    extends: kimurai
    command: crawl
    volumes:
      - ./:/app

Ruby 2.7.x obsolete warnings

Ruby 2.7.x triggers obsolete warnings for URI.escape. Could we consider using the well-maintained Addressable gem (https://github.com/sporkmonger/addressable) or using a different approach like CGI::escape or ERB::Util.url_encode?

I can create a pull request if that helps.

/usr/local/Cellar/rbenv/1.1.2/versions/2.7.1/lib/ruby/gems/2.7.0/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:7: warning: URI.escape is obsolete

Thanks.

helper is not loaded

I created google_spider.rb in ./helpers/

module GoogleHelper
  def time2int(time)
    time.to_i
  end
end

run console:

kimurai console --url https://www.google.com
[1] pry(#<ApplicationSpider>)> time2int(Time.now)

return:

NoMethodError: undefined method `time2int' for #<ApplicationSpider:0x00007fc655206ed0>
Did you mean?  timeout
from (pry):1:in `console'

How to limit the search depth level?

Like other scrap frameworks, e.g. Colly in Go

c := colly.NewCollector(
		// MaxDepth is 1, so only the links on the scraped page
		// is visited, and no further links are followed
		colly.MaxDepth(1),
	)

request_to method throws argument error for Ruby 3.0

Hello,

First, think you for maintaining this fantastic framework.

I set up a spider pretty much identically to the one in the README. I wrote a parse function with the same arguments as those specified in the README as well (response, url:, data: {})

In my first parse function I used the respond_to method to route urls to a second parse function, which had the same arguments as the first.

I got the following error: wrong number of arguments (given 2, expected 1; required keyword: url) (ArgumentError)

I'm running Ruby 3.0.1.

I believe there may be an issue with the use of keyword arguments in the request_to method related to Ruby 3.0. The spider works fine when I visit the url using the browser object and call the second parse function directly.

This appears to be similar to the related issue with rbcat

I'm relatively new to Ruby, so I apologize in advance for any inaccuracies!

Running on Ubuntu 20.04 gives chromedriver error

I wrote a scraper that works just fine on macOS. When I run the same scraper on Ubuntu 20.04, I get an error:

/home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:104:in `start': already started: #<URI::HTTP http://127.0.0.1:9515/> "/usr/local/bin/chromedriver" (RuntimeError)

To make sure it wasn't just my own script, I tried running:

$ kimurai console --engine selenium_chrome --url https://github.com/vifreefly/kimuraframework

This also failed with the same error:

D, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620] DEBUG -- : BrowserBuilder (selenium_chrome): created browser instance
D, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620] DEBUG -- : BrowserBuilder (selenium_chrome): enabled native headless_mode
I, [2020-11-13 15:34:54 -0500#7086] [M: 47354020103620]  INFO -- : Browser: started get request to: https://github.com/vifreefly/kimuraframework
Traceback (most recent call last):
	22: from /home/username/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `<main>'
	21: from /home/username/.rvm/gems/ruby-2.5.3/bin/ruby_executable_hooks:24:in `eval'
	20: from /home/username/.rvm/gems/ruby-2.5.3/bin/kimurai:23:in `<main>'
	19: from /home/username/.rvm/gems/ruby-2.5.3/bin/kimurai:23:in `load'
	18: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/exe/kimurai:6:in `<top (required)>'
	17: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	16: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	15: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	14: from /home/username/.rvm/gems/ruby-2.5.3/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	13: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/cli.rb:123:in `console'
	12: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
	11: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:52:in `visit'
	10: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:51:in `ensure in visit'
	 9: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/driver/base.rb:16:in `current_memory'
	 8: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:28:in `pid'
	 7: from /home/username/.rvm/gems/ruby-2.5.3/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:32:in `port'
	 6: from /home/username/.rvm/gems/ruby-2.5.3/gems/capybara-3.13.2/lib/capybara/selenium/driver.rb:32:in `browser'
	 5: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
	 4: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'
	 3: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'
	 2: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'
	 1: from /home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:303:in `service_url'
/home/username/.rvm/gems/ruby-2.5.3/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:104:in `start': already started: #<URI::HTTP http://127.0.0.1:9515/> "/usr/local/bin/chromedriver" (RuntimeError)

Again, this error only occurs on Ubuntu 20.04. When I try the same command on macOS, it works perfectly. I followed the Ubuntu 18.04 installation instructions in the README file, with the excption that I have installed the most recent versions of chromedriver and geckodriver like so:

cd /tmp && wget https://chromedriver.storage.googleapis.com/87.0.4280.20/chromedriver_linux64.zip
sudo unzip chromedriver_linux64.zip -d /usr/local/bin
cd /tmp && wget https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-linux64.tar.gz
sudo tar -xvzf geckodriver-v0.28.0-linux64.tar.gz -C /usr/local/bin
rm -f geckodriver-v0.28.0-linux64.tar.gz

The reason I did this was because other people reported similar errors and suggested upgrading chromedriver to eliminate them. This did not fix the issue for me, but I figured that it's better to have more recent versions anyway.

When I looked in htop to see why this error was occurring, I noticed that on Ubuntu, there were multiple instances of chromedriver opening during this command, whereas in macOS only one instance of it opened. I am not sure why this is happening, but I suspect it is related to the error because it's apparently complaining about more than one chromedriver instance being open.

How to set encoding?

When the website is encoding with GB2312, the content of the website can not be obtained normally.

I think it would be better to change

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body)
      when :json
        JSON.parse(body)
      end
    end

TO

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body,nil,@config[:encoding])
      when :json
        JSON.parse(body)
      end
    end

OR

    def current_response(response_type = :html)
      case response_type
      when :html
        Nokogiri::HTML(body.force_encoding("encoding"))
      when :json
        JSON.parse(body)
      end
    end

Pass response callback as block

The one thing that has bothered me about Scrapy is that callbacks can't be given inline to show the visual hierarchy of pages scraped. Ruby however has blocks. Could we do something like this?

request url: 'http://example.com' do |response|
    response.at_xpath("//a[@class='next_page']").each do |next_link|
        request url: next_link do |response2|
            #etc
        end
    end
end

Link here from archived project?

Hi, thanks for your work on this project, it's really nice to work with. One thing that keeps bothering me is that when you google 'kimurai' the top result is this archived project. Not sure why that happened, but it's probably a good idea to link here from there so people don't think the project is dead.

Kimurai in RoR with Devise - Get current_user in parse - Pass user info to parse - Completed status in `parse!`

I am trying kimurai framework for rails with devise

Inside parse I want to:

  • get the current_user (devise session), or
  • somehow I can pass the user info to parse, or
  • I can use the parse! which can return the items array but in this case how can I know if the process completed ok or with errors? When using crawl! it returns the response which contains this info...

Any ideas on how can I do this?

Thank you

Error when installing on Linux

Error when installing on ubuntu
Logs:

root@f7f25d74ee8e:/Users/toan/Desktop/ruby/main-backend# kimurai setup localhost --local 

PLAY [all] *****************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************
ok: [localhost]

TASK [Update apt cache] ****************************************************************************************************************************************************************
changed: [localhost]

TASK [Install base packages] ***********************************************************************************************************************************************************
[DEPRECATION WARNING]: Invoking "apt" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `pkg: "{{ item 
}}"`, please use `pkg: ['xvfb', 'libsqlite3-dev', 'sqlite3', 'mongodb-clients', 'mysql-client', 'libmysqlclient-dev', 'postgresql-client', 'libpq-dev']` and remove the loop. This 
feature will be removed in version 2.11. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
failed: [localhost] (item=['xvfb', 'libsqlite3-dev', 'sqlite3', 'mongodb-clients', 'mysql-client', 'libmysqlclient-dev', 'postgresql-client', 'libpq-dev']) => {"changed": false, "item": ["xvfb", "libsqlite3-dev", "sqlite3", "mongodb-clients", "mysql-client", "libmysqlclient-dev", "postgresql-client", "libpq-dev"], "msg": "No package matching 'mongodb-clients' is available"}
        to retry, use: --limit @/usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/automation/setup.retry

PLAY RECAP *****************************************************************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=1   

How to skip SocketError?

I, [2018-10-23T12:10:35.535202 #301]  INFO -- : Info: items: sent: 28, processed: 26
D, [2018-10-23T12:10:35.535471 #301] DEBUG -- : Browser: sleep 0.4 seconds before request...
I, [2018-10-23T12:10:35.935690 #301]  INFO -- : Browser: started get request to: https://www.video.com/?q=xxx&p=9
I, [2018-10-23T12:11:05.967685 #301]  INFO -- : Info: visits: requests: 38, responses: 37
I, [2018-10-23T12:11:05.967887 #301]  INFO -- : Browser: driver mechanize has been destroyed

Spider: stopped: {:spider_name=>"videos_spider", :status=>:failed, :error=>"#<SocketError: Failed to open TCP connection to www.video.com:443 (getaddrinfo: Name or service not known)>", :environment=>"production", :start_time=>2018-10-23 04:31:46 +0000, :stop_time=>2018-10-23 12:11:05 +0000, :running_time=>"7h, 39m", :visits=>{:requests=>38, :responses=>37}, :items=>{:sent=>28, :processed=>26}, :events=>{:requests_errors=>{}, :drop_items_errors=>{"#<Kimurai::Pipeline::DropItemError: Item download error.>"=>2}, :custom=>{}}}

I want to skip this error and keep spider to next page.

My spider's config:

@config = {
    skip_request_errors: [
      { error: RuntimeError, message: "404 => Net::HTTPNotFound" },
      { error: Net::HTTPNotFound, message: "404 => Net::HTTPNotFound" },
      { error: Down::ConnectionError, message: "Down::ConnectionError, Item Dropped." },
      { error: Net::OpenTimeout, message: "Net::OpenTimeout, Item Dropped." },
      { error: SocketError, message: "SocketError, Item Dropped." },
    ],
    before_request: {
      delay: 0.4
    }
  }

How to retry or skip Net::HTTP::Persistent::Error?

Hello
I sometimes get an error and want to skip it

Net::HTTP::Persistent::Error: too many connection resets (due to Net::ReadTimeout with #<TCPSocket:(closed)> - Net::ReadTimeout) after 0 requests on 47268817494080, last used 1585742984.5058694 seconds ago
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/patches/net_patches.rb:9:in `block in request_with_mini_profiler'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/mini_profiler/profiling_methods.rb:39:in `step'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rack-mini-profiler-2.0.1/lib/patches/net_patches.rb:8:in `request_with_mini_profiler'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/probes/net_http.rb:55:in `block in request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/fanout.rb:25:in `instrument'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/skylight-core-4.2.3/lib/skylight/core/probes/net_http.rb:48:in `request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:964:in `block in request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:662:in `connection_for'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/net-http-persistent-3.1.0/lib/net/http/persistent.rb:958:in `request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:131:in `process_remote_request'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:47:in `block (2 levels) in <class:Browser>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:68:in `process'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/browser.rb:23:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/rack_test/driver.rb:45:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/capybara-3.31.0/lib/capybara/session.rb:278:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/kimurai-1.4.0/lib/kimurai/base.rb:162:in `parse!'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:80:in `block in scrape_job_ads'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:78:in `each'
/home/aleksandra/Projects/hub/app/models/job_ad.rb:78:in `scrape_job_ads'
/home/aleksandra/Projects/hub/lib/tasks/scheduler.rake:78:in `block (2 levels) in <main>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/home/aleksandra/.rvm/gems/ruby-2.6.5/bin/ruby_executable_hooks:24:in `eval'
/home/aleksandra/.rvm/gems/ruby-2.6.5/bin/ruby_executable_hooks:24:in `<main>'

When added this error to the @config

    before_request: { delay: 120..180 },
    skip_request_errors: [{ error: Net::HTTP::Persistent::Error }],
    retry_request_errors: [
      { error: RuntimeError, message: '520', skip_on_failure: true }
    ]
  }```
got ```NameError: uninitialized constant Net::HTTP::Persistent```
Is it possible to skip this error and continue?

Crawl in Sidekiq - Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver

I try to run crawler via Sidekiq job on my DigitalOcean droplet, but always get fail with error Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver", in the same time I can run crawl! via rails console and it works well, also it works well via Sidekiq on my local machine. I defined chromedriver_path in the Kimurai initializer - config.chromedriver_path = Rails.root.join('lib', 'webdrivers', 'chromedriver_83').to_s
Logs of the Sidekiq job which I started also via rails console with FekoCrawlWorker.perform_async

Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.602Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: start
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Spider: started: feko_spider
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): created browser instance
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): enabled native headless_mode
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: started get request to: https://feko.com.ua/shop/category/kotly/gazovye-kotly331/page/1
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Info: visits: requests: 1, responses: 0
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: driver selenium_chrome has been destroyed
Jun 29 19:43:26 aquacraft sidekiq[7201]: F, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[1;31mFATAL -- feko_spider:#033[0m Spider: stopped: {#033[35m:spider_name#033[0m=>#033[33m"feko_spider"#033[0m, #033[35m:status#033[0m=>:failed, #033[35m:error#033[0m=>#033[33m"#<Selenium::WebDriver::Error::WebDriverError: not a file: \"./bin/chromedriver\">"#033[0m, #033[35m:environment#033[0m=>#033[33m"development"#033[0m, #033[35m:start_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:stop_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:running_time#033[0m=>#033[33m"0s"#033[0m, #033[35m:visits#033[0m=>{#033[35m:requests#033[0m=>#033[36m1#033[0m, #033[35m:responses#033[0m=>#033[36m0#033[0m}, #033[35m:items#033[0m=>{#033[35m:sent#033[0m=>#033[36m0#033[0m, #033[35m:processed#033[0m=>#033[36m0#033[0m}, #033[35m:events#033[0m=>{#033[35m:requests_errors#033[0m=>{}, #033[35m:drop_items_errors#033[0m=>{}, #033[35m:custom#033[0m=>{}}}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.607Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: fail: 0.006 sec
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: {"context":"Job raised exception","job":{"class":"FekoCrawlWorker","args":[],"retry":false,"queue":"default","backtrace":true,"jid":"7d134b4ee9407973d7803f0b","created_at":1593459806.6006012,"enqueued_at":1593459806.6006787},"jobstr":"{\"class\":\"FekoCrawlWorker\",\"args\":[],\"retry\":false,\"queue\":\"default\",\"backtrace\":true,\"jid\":\"7d134b4ee9407973d7803f0b\",\"created_at\":1593459806.6006012,\"enqueued_at\":1593459806.6006787}"}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver"
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:136:in `assert_file'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:140:in `assert_executable'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:138:in `binary_path'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:94:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `chrome'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:299:in `service_url'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:23:in `browser'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:32:in `port'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:28:in `pid'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/driver/base.rb:16:in `current_memory'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:51:in `ensure in visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:52:in `visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:128:in `block in crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `each'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/releases/20200627190630/app/workers/feko_crawl_worker.rb:9:in `perform'

Sidekiq worker code:

require 'sidekiq-scheduler'

class FekoCrawlWorker
  include Sidekiq::Worker

  sidekiq_options retry: false, backtrace: true, queue: 'default'

  def perform
    Crawlers::Feko.crawl!
  end
end

Wrap the Nokogiri response to reduce boilerplate

Currently, a Nokogiri object is passed as argument to a callback. This results in some boilerplate since some operations have to be defined over and over, like extracting the text, formatting the results, etc.

If instead of the current response a wrapper object was supplied, we could decorate it with some nice utility functions, and let it contain the URL. Wrapping the object would allow the definition of custom selectors besides css an xpath, such as regex, or a composite of any of these. It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's extract() and extract_first() of scrapy.Selector.

# in your scraper class
def parse_product_list_page(product_list, url:, data: {})
    product_ids = product_list.regex /"id":([0-9]+)\,/
end
#Page.rb
require 'forwardable'

class Page
  extend Forwardable

  def initialize(response, browser)
    @response = response
    @browser = browser
  end

  # get the current HTML page (fresh)
  def refresh
    @response = @browser.current_response
    self
  end

  #
  # extract methods
  #

  # general purpose entrypoint
  def extract(expression, multi: true, async: false)
    if async
      extract_on_ready(expression, multi: multi)
    elsif multi
      extract_all(expression)
    else
      extract_single(expression)
    end
  end

  # extract first element
  def extract_single(expression, **opts)
    extract_all(expression, **opts).first
  end

  # TODO: wrap results so we can apply a new expression on the subset
  def extract_all(expression, wrap=false)
    query = SelectorExpression.instance(expression)
    # self.send calls the delegated xpath() and css() functions, based on the type of the selector wrapper object ("expression"), which defaults to css
    Array(self.send(query.type, query.to_s))
  end

  def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
    retries.times do
      result = extract(expression, multi: multi, async: false)
      case result
      when Nokogiri::XML::Element
        return result
      when Nokogiri::XML::NodeSet, Array
        return result if !result.empty?
      end
      sleep 1
      refresh
    end
    default
  end

  #
  # Nokogiri wrapping
  #

  # delegate functions to the response object so this Page object responds to all classic parsing and selection functions
  def_delegators :@response, :xpath, :css, :text, :children

  def regex(selector)
    @response.text.scan(selector.to_s)
  end

end

Beyond this, it could be a consideration to also wrap the results of xpath() and css() calls, so we would have the same utility functions when doing a subquery:

page.xpath('//').css('.items').regex(/my-regex/)

in_parallel: undefined method `call' for "app":String (NoMethodError)

An error occurred when I used the in_parallel method

this is example

# amazon_spider.rb
require 'kimurai'

class AmazonSpider < Kimurai::Base
  @name = "amazon_spider"
  @engine = :mechanize
  @start_urls = ["https://www.amazon.com/"]

  def parse(response, url:, data: {})
    browser.fill_in "field-keywords", with: "Web Scraping Books"
    browser.click_on "Go"

    # Walk through pagination and collect products urls:
    urls = []
    loop do
      response = browser.current_response
      response.xpath("//li//a[contains(@class, 's-access-detail-page')]").each do |a|
        urls << a[:href].sub(/ref=.+/, "")
      end

      browser.find(:xpath, "//a[@id='pagnNextLink']", wait: 1).click rescue break
    end

    # Process all collected urls concurrently within 3 threads:
    in_parallel(:parse_book_page, urls, threads: 3)
  end

  def parse_book_page(response, url:, data: {})
    item = {}

    item[:title] = response.xpath("//h1/span[@id]").text.squish
    item[:url] = url
    item[:price] = response.xpath("(//span[contains(@class, 'a-color-price')])[1]").text.squish.presence
    item[:publisher] = response.xpath("//h2[text()='Product details']/following::b[text()='Publisher:']/following-sibling::text()[1]").text.squish.presence

    save_to "books.json", item, format: :pretty_json
  end
end

AmazonSpider.crawl!

this is error info

I, [2019-01-17 10:25:33 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: started: amazon_spider
D, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:34 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: started get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: finished get request to: https://www.amazon.com/
I, [2019-01-17 10:25:38 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Info: visits: requests: 1, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Spider: in_parallel: starting processing 63 urls within 3 threads
D, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100] DEBUG -- amazon_spider: BrowserBuilder (mechanize): created browser instance
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: started get request to: /gp/slredirect/picassoRedirect.html/
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Info: visits: requests: 2, responses: 1
I, [2019-01-17 10:25:48 +0800#12757] [C: 47339781353100]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
#<Thread:0x0000561c4db3dd18@/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:295 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)
I, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960]  INFO -- amazon_spider: Browser: driver mechanize has been destroyed
F, [2019-01-17 10:25:48 +0800#12757] [M: 47339757413960] FATAL -- amazon_spider: Spider: stopped: {:spider_name=>"amazon_spider", :status=>:failed, :error=>"#<NoMethodError: undefined method `call' for \"app\":String>", :environment=>"development", :start_time=>2019-01-17 10:25:33 +0800, :stop_time=>2019-01-17 10:25:48 +0800, :running_time=>"15s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
Traceback (most recent call last):
        14: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `block (2 levels) in in_parallel'
        13: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:301:in `each'
        12: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:309:in `block (3 levels) in in_parallel'
        11: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/base.rb:197:in `request_to'
        10: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/kimurai-1.3.2/lib/kimurai/capybara_ext/session.rb:21:in `visit'
         9: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/session.rb:265:in `visit'
         8: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/driver.rb:45:in `visit'
         7: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:23:in `visit'
         6: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:43:in `process_and_follow_redirects'
         5: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-3.12.0/lib/capybara/rack_test/browser.rb:65:in `process'
         4: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/capybara-mechanize-1.11.0/lib/capybara/mechanize/browser.rb:50:in `block (2 levels) in <class:Browser>'
         3: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:58:in `get'
         2: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:129:in `custom_request'
         1: from /home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/test.rb:266:in `process_request'
/home/mc/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/rack-test-1.1.0/lib/rack/mock_session.rb:29:in `request': undefined method `call' for "app":String (NoMethodError)

absolute_url corrupts url escaping it

I have an URL like this: https://www.example.com/path?query_param=N%2CU. absolute_url method:

def absolute_url(url, base:)
return unless url
URI.join(base, URI.escape(url)).to_s
end

escapes it so it becomes https://www.example.com/path?query_param=N%252CU, corrupting the URL and breaking the spider link following. What about adding an argument to absolute_url in order to skip escaping?

Update Readme to include 'lsof' aptfile

It took me hours to figure this out so I want to help anyone else having trouble getting this running on Heroku.

Kimurai uses the lsof command, so you need to install the apt heroku buildpack to support lsof. Follow the directions described on the buildpack page. You basically need to create an Aptfile with the single line lsof and include it in the root folder along with adding the heroku buildpack. Can you add this to the docs? Thanks!

Skip request error after retry

I have a site that times out, sometimes. I configured the @config to retry the errors, and to skip them if they fail, since I would like the spider to keep going. However, it seems the skip_request_errors option drops errors immediately. Is there a way to make retry_request_errors and skip_request_errors work together so errors are only dropped when the retries have been exhausted?

Issues with using skip_request_errors

I am trying to use the configuration provided, to skip 404 errors, but instead, I am getting Runtime error raised. Perhaps this is the intended behaviour, but I was expecting to get false or empty object, or something? Let me know if I misunderstood the functionality. Here is the configuration:

# frozen_string_literal: true

require 'kimurai'

module Spiders
  class Test < Kimurai::Base
    @name                = 'test_spider'
    @disable_images      = true
    @engine              = :mechanize
    @skip_request_errors = [
      { error: RuntimeError }
    ]

    def parse(response, url:, data: {})
    end
  end
end

If I then run it with Spiders::Test.parse!(:parse, url: 'https://google.com/asdfsdf'), I get back this error:

BrowserBuilder (mechanize): created browser instance
Browser: started get request to: https://google.com/asdfsdf
Browser: driver mechanize has been destroyed
Traceback (most recent call last):
        2: from (irb):2
        1: from (irb):2:in `rescue in irb_binding'
RuntimeError (Received the following error for a GET request to https://google.com/asdfsdf: '404 => Net::HTTPNotFound for https://google.com/asdfsdf -- unhandled response')

Am I doing something wrong or that's expected behaviour? I also tried this for the configuration:
{ error: RuntimeError, message: '404 => Net::HTTPNotFound' }

Create directories before saving item

It seems currently no mkdir -p is run when items are saved for the first time.

Errno::ENOENT: No such file or directory @ rb_sysopen - ./results/spidername/.categories.json
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `initialize'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `open'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:64:in `save_to_pretty_json'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:31:in `block in save'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:23:in `synchronize'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base/saver.rb:23:in `save'
  /usr/local/bundle/gems/kimurai-1.2.0/lib/kimurai/base.rb:236:in `save_to'
  /app/spiders/application_spider.rb:66:in `create_category'

Setting desired_capabilities

I'm running a crawler with the selenium_firefox engine but I got the following exception Selenium::WebDriver::Error::InsecureCertificateError:. As I see on this link https://github.com/SeleniumHQ/selenium/wiki/Ruby-Bindings#ssl-certificates it's something I need to configure on firefox.

Is it possible to set the option desired_capabilities through Kimurai?

Why class instead of instances?

Genuinely curious, it seems a bit unusual as it's not as straightforward to change the start_urls at runtime (if I understood correctly, class instance variables are not thread-safe, so if I change them at runtime, they might wreck havoc in something like Sidekiq?).

Better support for testing

As I was writing tests for the scraper I've made, I've realised its not super straightforward at the moment. It would be great to improve on that front:

  1. Add testing section to the documentation, showcasing how to set it up and test in Rails for example
  2. Expand global configuration options. I would have liked to be able to disable delay globally in the test environment, instead of doing this in every scraper I write: @config = { before_request: { delay: 1..2 } } unless Rails.env.test?
  3. Add automatic detection of the test environment. Currently I have to manually set it in the rails_helper: ENV['KIMURAI_ENV'] ||= 'test'

Support custom `max_retries`

I crawl a web site which network is very bad, I have to refresh 5 times or even more to get a normal response. I tried to configure retry_request_errors, but found that only retry 3 time.

def visit(visit_uri, delay: config.before_request[:delay], skip_request_options: false, max_retries: 3)

It would be great to improve on Supporting custom max_retries.

Allow declarative page & item definitions

Though just using the Scrapy-like callbacks is easy and straightforward to code, it would be extra nice to have a higher abstraction of concepts so we could declaratively write scrapers. This would remove boilerplate, remove selector logic from page navigation logic and additionally allow graceful handling of unexpected and unsupported page types that would otherwise crash the scraper (without error handling).

For example, it would be nice to be able to do this:

class YourSpider < ApplicationSpider
  ...
  item :product do
    text    :name, '#ProdPageTitle'
    int     :ean, '#ProdPageProdCode' do |r|
      r[/([0-9]+)/]
    end
    async do
      array   :images, combi(css('#ShopProdImagesNew img.ShopProdThumbImg'), xpath('@src')))
      text    :description, '#ProdPageTabsWrap #tab1'
      custom  :specs, '#ProdPageProperties > span' do |r|
        r.to_a.in_groups_of(2).map{|s| {
          name: s[0].text,
          value: s[1].text
        }}
      end
    end
  end
end

The block contains invocations to (predefined) field types which are given names, (a) selector(s) and optionally a block for post-processing of the Nokogiri result. Every field accepts an async argument that specifies whether the element is rendered by Javascript or not. The async block sets every field in it to be async, meaning, that the browser.current_page is queried a few times with a timeout until extraction of the specified element works (when the page is actually rendered):

def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
    retries.times do
      # the extract function determines the type of the selector expression so it knows
      # whether to call xpath() or css() on the Nokogiri object.
      result = extract(expression, multi: multi, async: false)
      case result
      when Nokogiri::XML::Element
        return result
      when Nokogiri::XML::NodeSet, Array
        return result if !result.empty?
      end
      sleep 1
      refresh #self.response = browser.current_response
    end
    default
  end

Because the selectors are declaratively defined, the expression type has to be given (and defaults to 'css'), since the Nokogiri css and xpath methods are called indirectly.

css('#ShopProdImagesNew img.ShopProdThumbImg')

This design allows for the following, given a parse_item() function that would extract all fields from the defined item of the same name as the current inline handler:

class YourSpider < ApplicationSpider
  # start from start_urls
  request_start do |category_list, **opts1|
      request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
        for link in product_list.css(".css-selector2")
          request_in :product, url: link do |product, **opts3|
            save_to "results.json", parse_item(), format: :pretty_json
          end
        end
      end
    end
end

You could even go further. If you move the result file definition to the item definition, then the inline callback handler could automatically and implicitly extract the item:

class YourSpider < ApplicationSpider
  ...
  # defined item with result file definition passed by hash
  item :product, file: "results/{:category}.json" do
    text :name, '#ProdPageTitle'
    text :category, css: '.category-name'
  end

  # item definition with result file config in body
  item :otherproduct do
    text :name, '#ProdPageTitle'
    text :category, css: '.category-name'
    save_to "results/{:category}.json", {
        format: :pretty_json,
        append: true,
        position: false
      }
  end

  # start from start_urls
  request_start do |category_list, **opts1|
      request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
        for link in product_list.css(".css-selector2")
          # this call requests the page on url, knows that it contains a :product Entity, auto-extracts from the predefined Entity selectors, and auto-saves it to a result file as defined in the entity.
          request_item :product, url: link
        end
      end
    end
end

The logical result leaves us with only a DSL for defining the relationship between pages and how to get from one to the next. If we would have a class-level Page description, we could have a singular parse() entrypoint that can figure out the page type on its own.

class YourSpider < ApplicationSpider
  # class-level declaration of a page type
  page :product_list do
    identifier css: 'body.product-list-page'
    has_many :product, css: '#productlist a.product-link'
  end

  page :product do
    identifier do |response|
      !response.xpath('//div[@id="product-image"]').empty? and response.css('body.is-product').length > 0
    end
  end
end

In ApplicationSpider:

def parse(response, url, **opts)
  @page_types.each do |page_definition|
    if page_definition.page_of_type(response)
      @entities[page_definition.name].parse(response)
    else
      puts "unrecognised page type at #{url}!"
    end
  end
end

The parse() entrypoint would automatically find the right Page definition to know how to parse it and how to branch to deeper pages. All deeper pages are also parsed using the singular parse() callback. The advantage of this approach is that the navigational flow gets very robust, since page types are explicitly identified by selectors in the Page definition. You would get a nice log of all unexpected page types (customized landing pages, error pages etc), and encountering them does not break the code or require error catching by the user.

The downside to the explicit approach is the customizeability when you need somethign specific to be done in order to parse a speciic page (type). To account for this, the parse() entrypoint would have to check if a user-defined callback is defined that fits the page definition, similar to how it works now. So for a :product Page definition, it would look for a parse_product_page(response, url, **opts) callback that allows a user to hook into the flow.

kimurai setup not passing all necessary arguments to ansible-playbook on Mac OS Catalina

Hi,

I'm getting this error while trying to use the kimurai setup command on a Ubuntu 18.04 LTS EC2, running from a fresh brew install ansible on a macbook pro on Catalina with ansible 2.9.7.

kimurai setup [email protected] --ask-sudo --ssh-key-path /Users/xxx/Development/ssh-keys/xxx.pem
usage: ansible-playbook [-h] [--version] [-v] [-k] [--private-key PRIVATE_KEY_FILE]
                        [-u REMOTE_USER] [-c CONNECTION] [-T TIMEOUT]
                        [--ssh-common-args SSH_COMMON_ARGS] [--sftp-extra-args SFTP_EXTRA_ARGS]
                        [--scp-extra-args SCP_EXTRA_ARGS] [--ssh-extra-args SSH_EXTRA_ARGS]
                        [--force-handlers] [--flush-cache] [-b] [--become-method BECOME_METHOD]
                        [--become-user BECOME_USER] [-K] [-t TAGS] [--skip-tags SKIP_TAGS] [-C]
                        [--syntax-check] [-D] [-i INVENTORY] [--list-hosts] [-l SUBSET]
                        [-e EXTRA_VARS] [--vault-id VAULT_IDS]
                        [--ask-vault-pass | --vault-password-file VAULT_PASSWORD_FILES] [-f FORKS]
                        [-M MODULE_PATH] [--list-tasks] [--list-tags] [--step]
                        [--start-at-task START_AT_TASK]
                        playbook [playbook ...]
ansible-playbook: error: argument --ssh-extra-args: expected one argument

I tried migrating back to ansible 2.8 but getting this :

kimurai setup [email protected] --ask-sudo --ssh-key-path /Users/xxx/Development/ssh-keys/xxx.pem
BECOME password:

PLAY [all] ******************************************************************************************

TASK [Gathering Facts] ******************************************************************************
ERROR! Unexpected Exception, this is probably a bug: cannot pickle '_io.TextIOWrapper' object

Looks like it's a matter of ansible's version (I've never used ansible before, only puppet and chef). I'll take a look around, but this might need some updating - or provide which version we should test against in the readme.

Note : No problem installing it through a localhost ansible install directly on the ubuntu machine, so not anything urgent whatsoever :)

Some minor warnings when using kimurai

Hey there,

Sorry to bother you.

A few harmless warnings are issued when run under -w

Reason I report this is because I am quite pedantic, so I have -w on all the time.

This then pulls these warnings into my projects.

I know I can silence them, e. g. via $VERBOSE and probably the Warning module,
but I am trying the lazy approach and report them here. Feel free to disregard
this please. :-)

/.gem/gems/kimurai-1.4.0/lib/kimurai/browser_builder.rb:12: warning: assigned but unused variable - e
/.gem/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:11: warning: assigned but unused variable - uri
/.gem/gems/kimurai-1.4.0/lib/kimurai/base_helper.rb:12: warning: assigned but unused variable - e
/.gem/gems/kimurai-1.4.0/lib/kimurai/base.rb:33: warning: instance variable @run_info not initialized

(For local but unused variables, they can either be removed, or if you want to keep them, a leading
_ such as _uri would work. For uninitialized instance variables, I typically bundle them all in a method
called reset() where they are initialized to nil. That silences that warning.

I use kimurai to query javascript-heavy websites that do not easily allow us to parse the result. For
that purpose it works very well. For example I use kimurai to query the remote world-time, from a
website that uses javascript. (God I hate javascript sooo much though ...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.