Code Monkey home page Code Monkey logo

zhihu-spider's Introduction

请不要用于任何商业目的,否则后果自负

ZhihuSpider

知乎爬虫:爬取知乎某一问题下的所有回答(回答数小于800左右)

基本思路

  • 将question id 进行遍历,存入文件,对问题进行过滤后爬取需要的回答

  • 目前项目爬取的机制是将滚动条拉取到页面底端,然后一次性抓取所有的回答元素,但由于目前知乎的缓冲加载机制,当回答数量过多时(大概800左右),前面的回答信息就抓取不到了;

    • 拟解决思路:边滚动边抓取(但不方便进行元素定位以避免重复抓取)

项目结构

│  config.py # 爬取链接及存储路径设置
│  README.md
│  requirements.txt
│  scanner.py # 获取有效的问题网址
|  filter_links.py # 按照一定规则筛选问题
│  ZhihuSpider.py # 知乎爬虫主程序
│
├─Driver
│      chromedriver.exe # Chrome驱动
│      geckodriver.exe # gecko驱动
│
└─Results
        result-2022-07-28-深度神经网络DNN是否模拟了人类大脑皮层结构.csv # 抓取结果样例

安装依赖

Python 3.7+

pip install -r requirements.txt

其中mac版本驱动适用于109版本chrome,其他版本请自行下载对应chrome

使用方法

  • 运行scanner.py获取有效的问题id网址,具体的问题以及回答数方便过滤,写入文件
  • 运行filter_links.py按照一定规律筛选问题,并写入文件zhihu_valid_links.txt
  • 下载对应浏览器的驱动并置于Driver文件夹==> 运行ZhihuSpider.py,结果写入zhihu_result.csv

抓取字段

question_title answer_url author_name fans_count created_time updated_time comment_count voteup_count content
问题名称 回答链接 作者昵称 粉丝数量 回答时间 最近修改时间 评论数量 赞同数量 回答文本内容

说明备注

zhihu-spider's People

Contributors

meter3 avatar dataaug avatar hughxuechen avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.