yaozeyuan / zhihuhelp_archived Goto Github PK
View Code? Open in Web Editor NEW(停止维护)快速将知乎内容转换为epub电子书, 请移步https://github.com/YaoZeyuan/zhihuhelp_with_node
(停止维护)快速将知乎内容转换为epub电子书, 请移步https://github.com/YaoZeyuan/zhihuhelp_with_node
异常原因:[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
基本不能用,都是这问题
当图片文件不存在时,直接就报错退出了
今晚把epub文件重构一下
例如这个问题:十行以内,你写过哪些比较酷的 Mathematica 代码? 中用到了一个公式,写出来是这样的
根据dict2Html.py中的
def imgFix(self, content):
for imgTag in re.findall(r'<img.*?>', content):
src = re.search(r'(?<=src=").*?(?=")', imgTag)
...
和
def getFileName(self, imgHref = ''):
return imgHref.split('/')[-1]
得到的fileName为:equation?tex=%28n%2Cm%2Cl%29%3D%284%2C0%2C3%29,无法建立图片文件,如:
程序运行结果为:
编写代码文档,主要介绍epub库和zhihu_parser库的使用,以及知乎助手的架构思路
http://issuestats.com/github/YaoZeyuan/ZhihuHelp__Python
在这里可以生成知乎助手的动态图片
在epub里加上目录功能,否则在KPW下阅读时很不方便
下载专栏后打开生成的epub显示损坏,重复下载了几次,都一样,在手机上能打开,但是基本只有专栏的标题和损坏的图片
报错了,新安装的 python 2.7.8
操作系统是 Windows 7 64 Bit
C:\Users\Administrator\Desktop\1.7.3.7>python --version
Python 2.7.8
C:\Users\Administrator\Desktop\1.7.3.7>python zhihuHelp.py
Traceback (most recent call last):
File "zhihuHelp.py", line 11, in <module>
helper.start()
File "C:\Users\Administrator\Desktop\1.7.3.7\src\main.py", line 47, in start
self.check_update()
File "C:\Users\Administrator\Desktop\1.7.3.7\src\main.py", line 99, in check_u
pdate
print u"检查更新。。。"
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-6: cha
racter maps to <undefined>
应该加上异常报错选项,现在这样一个个找太麻烦了
在阅读@湖玛 Humar的回答集锦时,有一行a链接过长,直接把页面撑开了。
需要在css里加上word-wrap: break-word属性进行限制
可以考虑单独写一个脚本,用于分析用户关注记录,单独生成ReadList
需要修改下css
可以在div上加上负margin,以突破body边框限制。同时需要加上等量padding,防止字符溢出
python zhihuHelp.py
Traceback (most recent call last):
File "zhihuHelp.py", line 6, in
import bs4
ImportError: No module named bs4
从上面得出少了bs4模块,我的是OS X系统,怎么导入呢
'NoneType' object has no attribute 'img'
如题
对epub生成器进行重构,增强鲁棒性以及代码规范性
在抓取view-source:https://www.zhihu.com/collection/19928423?page=113时,会引发Process finished with exit code -1073741571 (0xC00000FD)
怀疑是因为递归层次太多,爆栈了
比如说,属性用dict.key()遍历,别用dir,然后规范下方法名,现在的方法名还是不够正式
这名字起的- -
有时间就把这个类重新处理下,哪怕改成全命令式也无妨。现在这样子让人看起来很不舒服。
如这篇专栏http://zhuanlan.zhihu.com/everytingisphysics ,总显示网络连接异常,不知为何
目前设置值为100%,即放大到全屏显示,这样会导致一些较小的图片显示异常,应当改为max-width:100%,以避免这个问题
匹配用户提问数/回答数/专栏数/收藏夹数/公共编辑数失败
错误内容:
need more than 0 values to unpack
超时页面http://www.zhihu.com/people/qiao-er-53/answers?order_by=vote_num&page=49
正在读取答案页面,还有3/67张页面等待读取
正在读取答案页面,还有3/67张页面等待读取
打开网页超时
超时页面http://www.zhihu.com/people/qiao-er-53/answers?order_by=vote_num&page=25
答案录入数据库成功
匹配用户提问数/回答数/专栏数/收藏夹数/公共编辑数失败
错误内容:
need more than 0 values to unpack
匹配用户关注数/被关注数失败
错误内容:
need more than 0 values to unpack
匹配用户赞同数/感谢数/被收藏数/被分享数失败
错误内容:
need more than 0 values to unpack
正在读取答案页面,还有2/67张页面等待读取
正在读取答案页面,还有1/67张页面等待读取
答案录入数据库成功
没有收集到指定问题
错误信息:
'NoneType' object has no attribute 'getitem'
Exception in thread Thread-83:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.target(self.args, *self.kwargs)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/epubBuilder/imgDownloader.py", line 76, in worker
imgFile = open(self.targetDir + fileName, 'wb')
IOError: [Errno 63] File name too long: u'./\u77e5\u4e4e\u56fe\u7247\u6c60/equation?tex=%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%7BS%7Bn%7D+%7D+%3D%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%7B%5Cfrac%7Bb%7Bn%2B1%7D+-b%7Bn%7D+%7D%7Ba%7Bn%2B1%7D+-a%7Bn%7D+%7D+%7D+%3D%5Clim%7Bn+%5Crightarrow+%5Cinfty+%7D%5Cfrac%7B%5Cln%5Cfrac%7B%28%28n%2B1%29%21%29%5E%7Bn%2B2%7D+%7D%7B%28%5Cprod%7Bi%3D0%7D%5E%7Bn%2B1%7D%28i%21%29+%29%5E%7B2%7D+%7D+-%5Cln%5Cfrac%7B%28n%21%29%5E%7Bn%2B1%7D+%7D%7B%28%5Cprod%7Bi%3D0%7D%5E%7Bn%7D%28i%21%29+%29%5E%7B2%7D+%7D+%7D%7B%28n%2B1%29%5E%7B2%7D-n%5E%7B2%7D++%7D+'
推测是没有处理http和https区别的问题。知乎已经改为全站https
asks,questions,answers,logs等属性抓取失败
正在制作第1本电子书的第1节
Traceback (most recent call last):
File "zhihuHelp.py", line 8, in
mainClass.helperStart()
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/main.py", line 83, in helperStart
urlInfo = self.getUrlInfo(rawUrl)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/main.py", line 202, in getUrlInfo
urlInfo['worker'] = AuthorWorker(conn = self.conn, urlInfo = urlInfo)
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/worker.py", line 30, in init
self.setCookie()
File "/Users/oyc/Desktop/zhihuhelp1.7.1.5/codes/worker.py", line 83, in setCookie
cookieStr = Var[0]
TypeError: 'NoneType' object has no attribute 'getitem'
这是该问题的链接:http://www.zhihu.com/question/34500493
重复试过多次,结果生成的epub文件都只有问题描述,而没有回答者的答案,而其他答案比较多的倒是能够生成正确。
这是目标链接:http://www.zhihu.com/people/jun-si-43/answers
目标链接上面有80多个回答,但抓取下来的答案只有11个。
不断地重新抓取,结果都是一样只有11个
https://www.zhihu.com/topic/19571444/top-answers?page=17
在『发动**的深层原因是什么?』问题中,知乎已将该问题删除,但仍然在精华里显示了出来,导致解析失败
待修复
Issues提交测试
你好,非常感谢。
http://www.zhihu.com/question/20502275我在下载这个问题下的答案时,正常生成epub文件,但其中的答案中的图片都重复了一次,答主头像正常。
在版本1.7.2和1.7.3中都有这种情况。在1.7.2中,使用高清图片模式和普通图片模式都有重复。
不知道这个问题怎么解决?
修改epub和parser的代码,使之可以独立出来并直接安装在其他应用中
Traceback (most recent call last):
File "D:/MyDocument/Documents/GitHub/ZhihuHelp__Python/zhihuhelp1.7.0/zhihuHelp.py", line 15, in
mainClass.helperStart()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\main.py", line 103, in helperStart
collectionWorker.start()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 282, in start
self.leader()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 309, in leader
self.catchFrontInfo()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\worker.py", line 463, in catchFrontInfo
infoDict = parse.getInfoDict()
File "D:\MyDocument\Documents\GitHub\ZhihuHelp__Python\zhihuhelp1.7.0\codes\contentParse.py", line 376, in getInfoDict
1].a.get_text())
IndexError: list index out of range
下一版改正之
抓取的地址为:http://zhuanlan.zhihu.com/qinchao
“This page contains the following errors:
error on line 67 at column 7: Opening and ending tag mismatch: img line 0 and div
Below is a rendering of the page up to the first error.”
摘录来自: ZhihuHelp1.7.0. “专栏_覃超帝国兴亡史 - 在希望的田野上(qinchao)_知乎回答集锦”。 iBooks.
添加扩展设置项,实现按赞同,字数、只取每个问题下赞同数前10个回答等条件筛选答案的功能
多看表示知乎周刊的样式是用多看私有的图书制作软件做的,使用的是其私有技术,没有示例书籍。
只能手工hack了
hi, 下载收藏夹 http://www.zhihu.com/collection/61913303 时,会报错退出,报错信息如图(Mac OSX EI python 2.7.10):
另外,下载示例提供的网址生成的epub中也有报错信息。
20s还是太长了,还是10s吧
比如说在readlist中加入这个URL:
http://www.zhihu.com/question/31064773
抓取到的答案数经过清点只有37个,在浏览器下用正常模式访问知乎时,显示答案有127个(截止到目前),那么就说明大部分答案都遗漏了。
另外,线程数如果设置为1,貌似会进入死循环,无法正常抓取
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.