Code Monkey home page Code Monkey logo

zhihu-py3's Introduction

zhihu-py3 : 知乎非官方API库 with Python3

Author Build DocumentationStatus PypiVersion License PypiDownloadStatus

通知

由于知乎前端老是改阿改的,每次我都要更新弄的我好烦的说……

所以我开发了一个新的项目Zhihu-OAuth

这个新项目用了一些黑科技手段,反正应该是更加稳定和快速了!*而且还支持 Python 2 哟!* 稳定我倒是没测,但是这里有一个 速度对比

如果你是准备新开一个项目的话,我强烈建议你看看我的新项目~

如果你已经用 Zhihu-py3 写了一些代码的话,我最近会写一个从 Zhihu-py3 转到 Zhihu-OAuth 的简易指南,你也可以关注一下哟。

毕竟嘛,有更好的方案的话,为什么不试试呢?

功能

由于知乎没有公开API,加上受到zhihu-python项目的启发,在Python3下重新写了一个知乎的数据解析模块。

提供的功能一句话概括为,用户提供知乎的网址构用于建对应类的对象,可以获取到某些需要的数据。

简单例子:

from zhihu import ZhihuClient

Cookies_File = 'cookies.json'

client = ZhihuClient(Cookies_File)

url = 'http://www.zhihu.com/question/24825703'
question = client.question(url)

print(question.title)
print(question.answer_num)
print(question.follower_num)
print(question.topics)

for answer in question.answers:
    print(answer.author.name, answer.upvote_num)

这段代码的输出为:

关系亲密的人之间要说「谢谢」吗?
627
4322
['心理学', '恋爱', '社会', '礼仪', '亲密关系']
龙晓航 50
小不点儿 198
芝士就是力量 89
欧阳忆希 425
...

另外还有Author(用户)Answer(答案)Collection(收藏夹)Column(专栏)Post(文章)Topic(话题)等类可以使用,Answer,Post类提供了save方法能将答案或文章保存为HTML或Markdown格式,具体请看文档,或者zhihu-test.py

安装

本项目依赖于requestsBeautifulSoup4html2text

已将项目发布到pypi,请使用下列命令安装

(sudo) pip(3) install (--upgrade) zhihu-py3

希望开启lxml的话请使用:

(sudo) pip(3) install (--upgrade) zhihu-py3[lxml]

因为lxml解析html效率高而且容错率强,在知乎使用<br>时,自带的html.parser会将其转换成<br>...</br>,而lxml则转换为<br/>,更为标准且美观,所以推荐使用第二个命令。

不安装lxml也能使用本模块,此时会自动使用html.parser作为解析器。

PS 若在安装lxml时出错,请安装libxml和libxslt后重试:

sudo apt-get install libxml2 libxml2-dev libxslt1.1 libxslt1-dev

准备工作

第一次使用推荐运行以下代码生成 cookies 文件:

from zhihu import ZhihuClient

ZhihuClient().create_cookies('cookies.json')

运行结果

====== zhihu login =====
email: <your-email>
password: <your-password>
please check captcha.gif for captcha
captcha: <captcha-code>
====== logging.... =====
login successfully
cookies file created.

运行成功后会在目录下生成cookies.json文件。

以下示例皆以登录成功为前提。

建议在正式使用之前运行zhihu-test.py测试一下。

用法实例

为了精简 Readme,本部分移动至文档内。

请看文档的「用法示例」部分。

登录方法综述

为了精简 Readme,本部分移动至文档内。

请看文档的「登录方法综述」部分。

文档

终于搞定了文档这个磨人的小妖精,可惜 Sphinx 还是不会用 T^T 先随意弄成这样吧:

Master版文档

Dev版文档

其他

有问题请开Issue,几个小时后无回应可加最后面的QQ群询问。

友链:

  • zhihurss:一个基于 zhihu-py3 做的跨平台知乎 rss(any user) 的客户端。

TODO List

  • [x] 增加获取用户关注者,用户追随者
  • [x] 增加获取答案点赞用户功能
  • [x] 获取用户头像地址
  • [x] 打包为标准Python模块
  • [x] 重构代码,增加ZhihuClient类,使类可以自定义cookies文件
  • [x] 收藏夹关注者,问题关注者等等
  • [x] ZhihuClient增加各种用户操作(比如给某答案点赞)
  • [ ] Unittest (因为知乎可能会变,所以这个有点难
  • [x] 增加获取用户关注专栏数和关注专栏的功能
  • [x] 增加获取用户关注话题数和关注话题的功能
  • [x] 评论类也要慢慢提上议程了吧

联系我

Github:@7sDream

知乎:@7sDream

新浪微博:@Dilover

邮箱:给我发邮件

编程交流群:478786205

zhihu-py3's People

Contributors

ahonn avatar cssmlulu avatar glennq avatar gracker avatar laike9m avatar lishubing avatar littlezz avatar xen0n avatar zeroxfio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zhihu-py3's Issues

Answer类中的 content的失效

现在问题的详细信息的css类也是zm-editable-content

Answer类中获取content

    @property
    @check_soup('_content')
    def content(self):
        """以处理过的Html代码形式返回答案内容.

        :return: 答案内容
        :rtype: str
        """
        content = self.soup.find('div', class_='zm-editable-content')
        content = answer_content_process(content)
        return content

会变成获取问题的详细信息,

现在可以改成

 content = self.soup.find('div', class_='zm-editable-content clearfix')

Some test cases failed

In test/zhihu-test.py, some tests failed.
我的测试环境是windows 7, python 3.5.2.

# 获取关注问题的用户
for _, follower in zip(range(10), question.followers):
    print(follower.name)

# 获取提问时间
ctime = question.creation_time
print(ctime)
assert ctime == datetime.strptime('2014-08-12 17:58:07', "%Y-%m-%d %H:%M:%S")

# 获取最后编辑时间
last_edit_time = question.last_edit_time
print(last_edit_time)
assert last_edit_time >= datetime.strptime('2015-04-01 00:39:21', "%Y-%m-%d %H:%M:%S")

# 获取该答案所在问题标题
print(answer.question.title)

# 获得用户粉丝
for _, follower in zip(range(10), author.followers):
    print(follower.name)

只是run了一下zhihu-test.py, 发现以上几个test好像跑不通。
一部分报错信息如下。

===== test failed =====
Cleaning...Done
Traceback (most recent call last):
  File "zhihu-test.py", line 869, in <module>
    raise e
  File "zhihu-test.py", line 863, in <module>
    'test()', setup='from __main__ import test', number=1)
  File "C:\Python3\lib\timeit.py", line 213, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "C:\Python3\lib\timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "zhihu-test.py", line 813, in test
    test_question()
  File "zhihu-test.py", line 34, in test_question
    for _, follower in zip(range(10), question.followers):
  File "C:\Python3\lib\site-packages\zhihu_py3-0.3.17-py3.5.egg\zhihu\question.p
y", line 172, in followers
  File "C:\Python3\lib\site-packages\zhihu_py3-0.3.17-py3.5.egg\zhihu\common.py"
, line 225, in common_follower
  File "C:\Python3\lib\site-packages\requests-2.10.0-py3.5.egg\requests\models.p
y", line 812, in json
    return complexjson.loads(self.text, **kwargs)
  File "C:\Python3\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Python3\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python3\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

似乎不能得到Question类的creation_time和last_edit_time

今天重跑之前的爬虫发现了这个问题,出现JSONDecodeError。Question类和Answer类的其他方法好像没问题。不过既然这个项目已经停止维护了,看来目前只能移到zhihu-oauth了TAT……


啊,然而在zhihu-oauth里面没有Question的creation_time……遗憾


顺便一提,Answers类下面的upvoters也会出错。不过这个可以在zhihu-oauth里面找到对应的

Ubuntu环境下git安装字符编码问题有误

root@iZ28dhz7tobZ:/zhihu-py3# cd test/
root@iZ28dhz7tobZ:
/zhihu-py3/test# python3 zhihu-test.py
Test dir: /root/zhihu-py3/test/test
Test dir not exist.
Cookies file found.
Making test dir...Done

===== test start =====
===== test failed =====
Cleaning...Done
Traceback (most recent call last):
File "zhihu-test.py", line 869, in
raise e
File "zhihu-test.py", line 863, in
'test()', setup='from main import test', number=1)
File "/root/anaconda3/lib/python3.5/timeit.py", line 213, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/root/anaconda3/lib/python3.5/timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "", line 6, in inner
File "zhihu-test.py", line 813, in test
test_question()
File "zhihu-test.py", line 18, in test_question
print(question.title)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-15: ordinal not in range(128)

外部是否存在解决方法?

是否考虑增加完善用户操作类的功能?

在用户操作类里面只有赞同反对关注取消等,是否考虑添加诸如发送私信、添加答案、回复评论、更新设置一类的功能?

是不是有考虑到容易被滥用所以没有写么?

用户id太长时,会报错

user = client.from_url(v.profileUrl)
print (user.follower_count)

Traceback (most recent call last):
File "zhihu.py", line 70, in
process()
File "zhihu.py", line 63, in process
print (user.follower_count)
File "Env/codelab/lib/python2.7/site-packages/zhihu_oauth/zhcls/normal.py", line 59, in wrapper
self._get_data()
File "Env/codelab/lib/python2.7/site-packages/zhihu_oauth/zhcls/base.py", line 75, in _get_data
raise e
zhihu_oauth.exception.GetDataErrorException

example:
https://www.zhihu.com/people/meng-meng-meng-meng-meng-meng-meng-duo-nuo

具体是什么原因没仔细看,不过看起来只要https://www.zhihu.com/people/, 这个userid超过了一定的长度,就报错..

分离http请求与处理soup这两部分,在此基础上完善test

目前test似乎没起到应有的作用,基本上都是print获得的信息,而非检查信息的内容,因此只有在有Exception的情况下才会失败。并且也没有用任何单元测试的框架。

因为开发似乎还挺活跃的,我想有必要完善test,这样每次修改代码之后能跑一下test确保没有引入新bug。但其中一个问题是不少功能都是需要从知乎网页上直接获取数据,这些数据是会变化的。所以我认为比较合理的做法是,把发送请求的部分和处理soup提取数据的部分尽可能分开,前者代码尽可能简单到一目了然。

这样一来就可以本地保存一些网页副本,通过直接读取到self.soup来,然后对比提取出来的数据与应有的值是否相等来测试每个function的功能。当然单元测试的框架比如unittest最好也用上,会方便不少。

如果需要的话我也可以帮忙写一部分代码。

关于Author.activity的url需求

image

如图其实每一个activity都是有url的,之前我说的新需求:回答的url。其实这个需求没有思考清楚。

我想要的是这种无区别的url,在每一个activity里的,作为我加载单个feed的webview的参数。

缺少分页选项

例如 "zhuanlan.zhihu.com/api/posts/:id/likers?limit=20&offset=20" 接口已经采用分页返回,项目似乎并没支持。

setuptools 打包

准备写打包 patch (方便用 pip 管理), 需要这些信息:

  • 版本号
  • 许可证

在这回复就行了, 我在自己的 branch 已经调整了项目结构, 到时候自己写进去就行, 这样就省得 merge 蛋疼了

查看详细资料

能不能再添加一个查看用户详细资料,以便获得更多职业及教育背景相关的信息?

建议对问题、回答提供状态检查接口

对问题、答案提供状态检查区分
回答:「正常」「建议修改:政治」「建议修改:不友善」「已删除」
问题:「正常」「关闭:xxx」「已删除」
直接处理为

<p>
   回答建议修改:不宜公开讨论的政治内容
</p>

貌似不太好
——当然不急

使用Topic类下的children方法无法获得全部子话题

你好!Topic类的children方法无法获得某话题下的全部子话题,因为许多话题的子话题很多,当前页面显示不完,所以需要点击“加载更多”才能显示更多的子话题。而目前的children方法只能获得某话题页面中初始显示的那些话题。例如:「形而上」话题有44个子话题,而通过children方法只能获得其中的10个子话题。
image

建议增加获取最近动态和关注者的功能

你的TODO:

image

help

我需要一个新的功能——某个用户的最新动态,,求开发。列列我的设想,以后应该还有合作。

ps:你的库好用啊,而且用pydoc看到的文档,写的真不错,点个赞

需要Author.acitivities的filter,来filter回答问题和发布文章

因为如果自己迭代如下:

for act in activitys:
    if act.type in [zhihu.ActType.ANSWER_QUESTION, zhihu.ActType.PUBLISH_POST]:

会将每一个act都生成,但是实际上回答问题和发布文章,只占所有act的1%吧(平均的话大概)。
所以filter可以极快的加速性能。

Topic类下questions生成器的疑问

你好!谢谢你写的程序。我是个初学者,觉得受益匪浅,但现在对Topic类下的questions生成器有点疑问。不知道为什么用topic.questions.next()或者next(topic.questions) 出来的全部是同一个问题,而for question in topic.questions就不会这样。请问这是为什么呢?
另外,今天我试着爬了200多个之后就出现了错误,打开浏览器发现要求输入验证码了。这种反爬虫的机制除了暂停程序之外应该怎么对付呀?

p.s. 希望zhihu-oauth那边能够对topic类加入questions生成器 >_<

登陆失败,原因:none

====== zhihu login =====
email:*
password:*
please check captcha.gif for captcha
captcha: d9sg
====== logging.... =====
login failed, reason: None

测试报错:不支持 beauifulsoup4 最新版4.4.0

System: Mac OSX Yosemite 10.10.4

<zhihu.Answer object at 0x1030a5e48>
<generator object top_i_answers at 0x105553510>
<generator object answers at 0x105553510>
Traceback (most recent call last):
  File "zhihu-test.py", line 267, in <module>
    test_question()
  File "zhihu-test.py", line 48, in test_question
    for answer in question.answers:
  File "/Users/jiaqiluo/Documents/python_learning/zhihu-py3-master/zhihu.py", line 376, in answers
    _xsrf = self.soup.find('input', attrs={'name': '_xsrf'})['value']
TypeError: 'NoneType' object is not subscriptable

我刚刚开始学习python,可能这个问题比较低级,我也在积极寻找解决方法,但是还是希望作者(或其他人)帮助解决。

谢谢

在topic.hot_questions與topic.questions下面一直出現Error

你好

以下是我的code與error的內容
想請你看一下是哪裡出了問題
謝謝你


Code:
topic_travel_in_Taiwan = client.topic('http://www.zhihu.com/topic/19755487')
questions_in_topic = topic_travel_in_Taiwan.questions
first_q = next(questions_in_topic)

出現error如下
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 293, in questions
older_time_stamp = int(questions[-1].h2.span['data-timestamp'])
IndexError: list index out of range


Code:
hot_questions = topic_travel_in_Taiwan.hot_questions
next(hot_questions)

出現過兩個不同的error

Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 386, in hot_questions
params = {'start': 0, '_xsrf': self.xsrf}
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/common.py", line 100, in wrapper
value = func(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 45, in xsrf
return self.soup.find('input', attrs={'name': '_xsrf'})['value']
TypeError: 'NoneType' object is not subscriptable

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 473, in _get_score
_ = h2['class']
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/element.py", line 997, in getitem
return self.attrs[key]
KeyError: 'class'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 396, in hot_questions
questions.sort(key=self._get_score, reverse=True)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu/topic.py", line 476, in _get_score
return div.parent.parent['data-score']
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/element.py", line 997, in getitem
return self.attrs[key]
KeyError: 'data-score'

登录方法失效

验证码部分好像已经失效了
具体到login(email,password,captcha)
<Response [405]>
可能需要考虑改进登录方法

报了一个错,,遍历activity的时候qaq

File "/Users/yuwei/PycharmProjects/qt_project/zhihu_pyqt/src/model/feeds_list.py", line 97, in get_feeds
for act in activities:
File "/usr/local/lib/python3.4/site-packages/zhihu.py", line 758, in activities
gotten_feed_num = res.json()['msg'][0]
File "/usr/local/lib/python3.4/site-packages/requests/models.py", line 819, in json
return json.loads(self.text, **kwargs)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/init.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py", line 361, in raw_decode
raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 1 column 1 (char 0)

python 3.2.3安装出现如下问题

from zhihu import ZhihuClient
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.2/dist-packages/zhihu/init.py", line 10, in
from .activity import Activity
File "/usr/local/lib/python3.2/dist-packages/zhihu/activity.py", line 12, in
from .topic import Topic
File "/usr/local/lib/python3.2/dist-packages/zhihu/topic.py", line 81
yield Topic(Zhihu_URL + topic_tag['href'],
SyntaxError: 'return' with argument inside generator

大神们告诉我,怎么解决?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.