yaozeyuan / zhihuhelp_archived Goto Github PK

View Code? Open in Web Editor NEW

430.0 51.0 134.0 6.25 MB

(停止维护)快速将知乎内容转换为epub电子书, 请移步https://github.com/YaoZeyuan/zhihuhelp_with_node

Python 76.05% CSS 0.60% HTML 23.34%

zhihuhelp_archived's People

Contributors

Stargazers

Watchers

Forkers

licsh longhongjun marklog russell-al noiron jinyonglner axure zmywly8866 captainwong thddaniel liuzhga zhiyue-archive aurora1625 oliver7l housansan znanl rockyzsu emonki sherzlock xlywti daizhibin heamon7 bopngma iyueyun huwenshu 447327642 booox unbirdlikebird cjhgo heqingquan tedko flty hawthorn2013 shuijinliuxi kit393 tamamadesu dalufine daoli ledudu chaofz remerci aska945 winterlike hehe2048 hmilyfyj wgzj tianshuli zuiwan silentgrape eaglewei houseyin lichamnesia leebivip zhongzzplf knarfeh hwang91 zjlx mythkina hanjinze soso1640 lancerpilgrim oxsard hxndg zzti ttkltll wsbmilk mozii fengwenpei eignil beerpoet veterun geoxliu joint-song alex1990218 bsharkl sstd521 linhua55 ll-w dengzeyan poofee aabb667 icanp1ay alberthawking dwy2288528 johnburen lanyuan27 manxiaoca tuobashaoli outlook-code biubiu1 divineh lic91 fletcher9527 guonning glh88 gavin971 handuozhang fufudezhengx mazhongbin ckwsens

zhihuhelp_archived's Issues

更新后不能用

异常原因:[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

基本不能用，都是这问题

抓取答案内容时，忽略了一种图片格式，导致程序异常

例如这个问题：十行以内，你写过哪些比较酷的 Mathematica 代码？中用到了一个公式，写出来是这样的

根据dict2Html.py中的

def imgFix(self, content):
        for imgTag in re.findall(r'<img.*?>', content):
            src = re.search(r'(?<=src=").*?(?=")', imgTag)
...

和

def getFileName(self, imgHref = ''):
        return imgHref.split('/')[-1]

得到的fileName为：equation?tex=%28n%2Cm%2Cl%29%3D%284%2C0%2C3%29，无法建立图片文件，如：

程序运行结果为：

编写代码文档

编写代码文档，主要介绍epub库和zhihu_parser库的使用，以及知乎助手的架构思路

动态图标生成网站

http://issuestats.com/github/YaoZeyuan/ZhihuHelp__Python

在这里可以生成知乎助手的动态图片

答案中代码块的显示问题

self.content = content.replace('\r', '').replace('\n', '')

是不是用这一行代码把缩进什么的都删掉了？这样的话，如果答案中有缩进的代码就会比较难看，比如：

理想的是这样的：

我也在想办法解决，不知道有什么坑需要注意？

下载专栏后打开生成的epub显示损坏

下载专栏后打开生成的epub显示损坏，重复下载了几次，都一样，在手机上能打开，但是基本只有专栏的标题和损坏的图片

UnicodeEncodeError: 'charmap'

报错了，新安装的 python 2.7.8
操作系统是 Windows 7 64 Bit

C:\Users\Administrator\Desktop\1.7.3.7>python --version
Python 2.7.8

C:\Users\Administrator\Desktop\1.7.3.7>python zhihuHelp.py
Traceback (most recent call last):
  File "zhihuHelp.py", line 11, in <module>
    helper.start()
  File "C:\Users\Administrator\Desktop\1.7.3.7\src\main.py", line 47, in start
    self.check_update()
  File "C:\Users\Administrator\Desktop\1.7.3.7\src\main.py", line 99, in check_u
pdate
    print   u"µúÇµƒÑµ¢┤µû░πÇéπÇéπÇé"
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-6: cha
racter maps to <undefined>

分析http://www.zhihu.com/collection/37406996?page=19时会异常退出

应该加上异常报错选项，现在这样一个个找太麻烦了

为a标签加上break-word属性

在阅读@湖玛 Humar的回答集锦时，有一行a链接过长，直接把页面撑开了。
需要在css里加上word-wrap: break-word属性进行限制

使用负边距实现知乎周刊的效果

可以在div上加上负margin，以突破body边框限制。同时需要加上等量padding，防止字符溢出

下载最新源码运行失败

python zhihuHelp.py
Traceback (most recent call last):
File "zhihuHelp.py", line 6, in
import bs4
ImportError: No module named bs4

从上面得出少了bs4模块，我的是OS X系统，怎么导入呢

新的报错

'NoneType' object has no attribute 'img'

在table中添加page-break-inside属性，避免分页

如题

解析器故障

在抓取view-source:https://www.zhihu.com/collection/19928423?page=113时，会引发Process finished with exit code -1073741571 (0xC00000FD)
怀疑是因为递归层次太多，爆栈了

修正Config.py

比如说，属性用dict.key()遍历，别用dir，然后规范下方法名，现在的方法名还是不够正式

为什么选择2高清模式，结果下载下来的epub容量比模式1的要小得多呢

把RawBook处理下

这名字起的- -
有时间就把这个类重新处理下，哪怕改成全命令式也无妨。现在这样子让人看起来很不舒服。

含tex页面还是报错，错误代码11001

如这篇专栏http://zhuanlan.zhihu.com/everytingisphysics ，总显示网络连接异常，不知为何

div.content img的图片宽度有问题

目前设置值为100%，即放大到全屏显示，这样会导致一些较小的图片显示异常，应当改为max-width:100%，以避免这个问题

匹配用户提问数/回答数/专栏数/收藏夹数/公共编辑数失败
错误内容:
need more than 0 values to unpack
超时页面http://www.zhihu.com/people/qiao-er-53/answers?order_by=vote_num&page=49
正在读取答案页面，还有3/67张页面等待读取
正在读取答案页面，还有3/67张页面等待读取
打开网页超时
超时页面http://www.zhihu.com/people/qiao-er-53/answers?order_by=vote_num&page=25
答案录入数据库成功
匹配用户提问数/回答数/专栏数/收藏夹数/公共编辑数失败
错误内容:
need more than 0 values to unpack
匹配用户关注数/被关注数失败
错误内容:
need more than 0 values to unpack
匹配用户赞同数/感谢数/被收藏数/被分享数失败
错误内容:
need more than 0 values to unpack
正在读取答案页面，还有2/67张页面等待读取
正在读取答案页面，还有1/67张页面等待读取
答案录入数据库成功
没有收集到指定问题
错误信息:
'NoneType' object has no attribute 'getitem'

这是该链接：http://www.zhihu.com/people/qiao-er-53/answers

异常

Exception in thread Thread-83:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.target(self.args, *self.kwargs)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/epubBuilder/imgDownloader.py", line 76, in worker
imgFile = open(self.targetDir + fileName, 'wb')
IOError: [Errno 63] File name too long: u'./\u77e5\u4e4e\u56fe\u7247\u6c60/equation?tex=%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%7BS%7Bn%7D+%7D+%3D%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%7B%5Cfrac%7Bb%7Bn%2B1%7D+-b%7Bn%7D+%7D%7Ba%7Bn%2B1%7D+-a%7Bn%7D+%7D+%7D+%3D%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%5Cfrac%7B%5Cln%5Cfrac%7B%28%28n%2B1%29%21%29%5E%7Bn%2B2%7D+%7D%7B%28%5Cprod%7Bi%3D0%7D%5E%7Bn%2B1%7D%28i%21%29+%29%5E%7B2%7D+%7D+-%5Cln%5Cfrac%7B%28n%21%29%5E%7Bn%2B1%7D+%7D%7B%28%5Cprod%7Bi%3D0%7D%5E%7Bn%7D%28i%21%29+%29%5E%7B2%7D+%7D+%7D%7B%28n%2B1%29%5E%7B2%7D-n%5E%7B2%7D++%7D+'

显示大量“网页打开失败”的错误

推测是没有处理http和https区别的问题。知乎已经改为全站https

修复author中asks,questions等属性抓取失败的bug

asks,questions,answers,logs等属性抓取失败

今天更新版本(1.7.1.5)后出现问题，之前没问题(1.7.1.4)

正在制作第1本电子书的第1节
Traceback (most recent call last):
File "zhihuHelp.py", line 8, in
mainClass.helperStart()
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/main.py", line 83, in helperStart
urlInfo = self.getUrlInfo(rawUrl)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/main.py", line 202, in getUrlInfo
urlInfo['worker'] = AuthorWorker(conn = self.conn, urlInfo = urlInfo)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/worker.py", line 30, in init
self.setCookie()
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/worker.py", line 83, in setCookie
cookieStr = Var[0]
TypeError: 'NoneType' object has no attribute 'getitem'

问题页面中的div.question-info样式有问题

div.question-info应该向左浮动，css中写成向右了

解析器（or知乎）故障

https://www.zhihu.com/topic/19571444/top-answers?page=17

在『发动**的深层原因是什么？』问题中，知乎已将该问题删除，但仍然在精华里显示了出来，导致解析失败

待修复

测试

Issues提交测试

生成epub中图片重复

你好，非常感谢。
http://www.zhihu.com/question/20502275我在下载这个问题下的答案时，正常生成epub文件，但其中的答案中的图片都重复了一次，答主头像正常。
在版本1.7.2和1.7.3中都有这种情况。在1.7.2中，使用高清图片模式和普通图片模式都有重复。
不知道这个问题怎么解决？

更新epub库和parser库代码

修改epub和parser的代码，使之可以独立出来并直接安装在其他应用中

多线程处理异常，数据堆栈错误

http://www.zhihu.com/people/wheeler

http://www.zhihu.com/people/emily-lou

抓取这两人时出现的问题，默认帐号，设定，图片1

下载私人收藏夹时网址分析器会报错退出

Traceback (most recent call last):
File "D:/MyDocument/Documents/GitHub/ZhihuHelp__Python/zhihuhelp1.7.0/zhihuHelp.py", line 15, in
mainClass.helperStart()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\main.py", line 103, in helperStart
collectionWorker.start()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 282, in start
self.leader()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 309, in leader
self.catchFrontInfo()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 463, in catchFrontInfo
infoDict = parse.getInfoDict()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\contentParse.py", line 376, in getInfoDict
1].a.get_text())
IndexError: list index out of range

下一版改正之

生成电子成功但是内容有错

抓取的地址为：http://zhuanlan.zhihu.com/qinchao

“This page contains the following errors:
error on line 67 at column 7: Opening and ending tag mismatch: img line 0 and div
Below is a rendering of the page up to the first error.”

摘录来自: ZhihuHelp1.7.0. “专栏_覃超帝国兴亡史 - 在希望的田野上(qinchao)_知乎回答集锦”。 iBooks.

抓取到的答案数经过清点只有37个，在浏览器下用正常模式访问知乎时，显示答案有127个（截止到目前），那么就说明大部分答案都遗漏了。

另外，线程数如果设置为1，貌似会进入死循环，无法正常抓取