Code Monkey home page Code Monkey logo

spider's Introduction

新闻爬虫,将http://xw.qq.com/simple/s/index/index.htm下的几种栏目的新闻内容爬下来 栏目: 新闻、体育、财经、娱乐、房产

  1. create project

mvn archetype:create -DgroupId=com.app.lgr -DartifactId=spider
  1. create tables in mysql

SET SQL_SAFE_UPDATES = 0;
CREATE TABLE `news_category` (
	`id` bigint NOT NULL AUTO_INCREMENT,
	`name` varchar(20) NOT NULL,
	`url` varchar(255) NOT NULL,
	`desc` varchar(255),
	PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
INSERT INTO `news_category` (`name`,`url`, `desc`)
VALUES
('新闻','http://xw.qq.com/simple/s/news/index.htm','新闻栏目')
,('财经','http://xw.qq.com/simple/s/finance/index.htm','财经栏目')
,('娱乐','http://xw.qq.com/simple/s/ent/index.htm','娱乐栏目')
,('体育','http://xw.qq.com/simple/s/sports/index.htm','体育栏目');
CREATE TABLE `news_item` (
     `id`	bigint NOT NULL AUTO_INCREMENT,
     `title` varchar(100) NOT NULL,
     `content` text,
     `source` varchar(100),
     `create_time` datetime,
     `hits` bigint,
     `category_id` bigint NOT NULL,
	 PRIMARY KEY (`id`),
     KEY `FK3728B9281A2` (`category_id`),
     CONSTRAINT `FK3728B9281A2` FOREIGN KEY (`category_id`) REFERENCES `news_category` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
  1. build and run

git clone  https://github.com/whxiyi100829/spider.git
cd spider
mvn clean assembly:assembly -DskipTests
cd target/spider-1.0-SNAPSHOT
# vim conf/config.properties change url、user and password for database
vim conf/config.properties
# modify bin/startSpider.sh change APP_HOME path
vim bin/startSpider.sh
# run
sh bin/startSpider.sh
  1. roadmap

2014-08-09 添加下载图片和过滤视频功能

spider's People

Watchers

Xiyi Wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.