Code Monkey home page Code Monkey logo

symfony-spider's Introduction

SYMFONY-SPIDER

一个使用非常简单的多进程爬虫,基于php的symfony框架开发

依赖服务

  • php >= 5.6

安装

git clone [email protected]:Jaggle/symfony-spider.git spider
cd spider 
composer install

composer install命令的最后,根据提示输入数据库配置(数据库名称现在给出,但是不需要现在建数据库,下面的命令会帮助自动创建数据库) 以及redis dsn(例如:redis://pass@localhost)。

创建数据库

php app/console doctrine:database:create

创建表结构

php app/console doctrine:schema:update --force --dump-sql

创建一个爬虫

php app/console spider:create

添加抓取规则

vim app/config/rules.json

规则介绍:

目前只能爬取四个字段,下面以爬取知乎为例:

{
  "sf-spider" : {
    "linkRule": {
      "status": false,
      "rule": ""
    },
    "documentRule": {
      "title":  {
        "type": "text",
        "rule": "h1.QuestionHeader-title"
      },
      "meta": {
        "type": "href",
        "rule": ".UserLink-link"
      },
      "desc": {
        "type": "text",
        "rule": ".QuestionHeader-detail span.RichText.ztext"
      },
      "content": {
        "type": "html",
        "rule": "div.RichContent .RichContent-inner"
      }
    }
  }
}

运行爬虫

SPIDER_NAME为你创建的爬虫名称,例如我的为“sf-spider”。

php app/console spider:run --spiderName=SPIDER_NAME --workerCount=4 

or

php app/console spider:run SPIDER_NAME --workerCount=4 

or 开启日志输出

php app/console spider:run SPIDER_NAME --workerCount=4 --debug
  • workerCount: 启动的进程数量,默认为1
  • spiderName: 爬虫名称,默认"default"

执行过程

               |  -- job进程<抓取网页内容>
master进程 -----| -- job进程<抓取网页内容>
               |  -- job进程<抓取网页内容>


任务队列 ----| -- 文档任务,分析网页,进行文档的入库操作
            | -- job任务,控制job的状态,给job进程分配job

symfony-spider's People

Contributors

jjsty1e avatar rainbon avatar

Stargazers

 avatar  avatar

Watchers

 avatar

symfony-spider's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.