Code Monkey home page Code Monkey logo

scrapy-spiders's Introduction

This repo is examples of webcrawlers built using the Scrapy python framework. For more details about Scrapy..

Scrapy-Spiders (functional examples)

Collection of python scripts I have created to crawl various websites, mostly for lead generation projects to match keywords and collect email addresses and post URLs.

Each script is organized within it's own folder and then combined into a single run script within the 'dist' folder.

Contribute

If you create a new crawler, please add it to the repo and send me a pull request. I'd like to build this up as a collection beyond just these few I wrote myself.

Spiders

Functionality

  • Specify keywords
  • Specify all time (by leaving null) or specific date
  • Specify specific category (ie Gigs, Jobs, For Sale, etc..) or full site (by leaving null).
  • Bot crawls all of Craigslist (worldwide, all domains) searching for posts posted on the specified timeperiod which match any of the specified keywords. All matches trigger the bot to search for any email addresses written within the post body, as well as the reply email address, the email address is recorded to a txt file in CSV format.

Sites

  • subdomain.Craigslist.org
  • Mandy.com
  • ProductionHub.com
  • EntertainmentCareers.net
  • NewEnglandFilm.com
  • Reel-scout.com

About Scrapy

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

ref: http://www.scrapy.org

Setup Guide

// Remove previous installation of Python 2.7 sudo rm -R /System/Library/Frameworks/Python.framework/Versions/2.7

// Install Homebrew $ ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

// Add to profile ~/.bashrc export PATH=/usr/local/bin:/usr/local/sbin:$PATH

// Fresh install Python 2.7 brew install python

// Add to profile ~/.bashrc export PATH=/usr/local/share/python:$PATH

// Install Scrapy library with pip (pip should have been installed by brew when you installed python) pip install Scrapy

// Setup project directory wherever you want (I just put it on my desktop) sudo mkdir ~/Desktop/ProjectName cd ~/Desktop/ProjectName

//Setup individual spider projects scrapy startproject spiderOne scrapy startproject spiderTwo scrapy startproject spiderThree

//To run each spider individually cd ~/Desktop/ProjectName cd spiderOne scrapy crawl spiderOne

etc..

!!WARNING!!

You should not scrape any website that you do not own unless you have gotten consent from the webmaster of the site. Using these scripts, can and will cause you to be blacklisted if used abusively and/or incorrectly. This repository is for reference purposes only and should not be run on a live environment.

scrapy-spiders's People

Contributors

dcondrey avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.