lzjun567 / python_scripts Goto Github PK

一些python相关的演示代码

License: Apache License 2.0

Python 0.66% HTML 99.34%

python_scripts's Introduction

zhangfeidev families peisongao zhaowenkui hdjn point6013 catpolice fearlessroy jiangshijie rightian plupy hughhugo dingyj3178 kookro sexcode yangchaosword vodaka awesome-archive magic1012 jumbo-wjb waiwaib bergcb dongshimou andrewmo hubqin qznan jjkenny henrynight youlingwangzi unreal0 lincolnburrows2017 zh-h ifunpi ayocross steelzheng qiwzqq 8828 zzti czariron tianzhaodong hubaoquan gordonlw huaan011 andy2080 bytedaring liuyuqingenglish memorywalker wentium expansion cllshxz olivierh59500 ruo2012 blake2002 de8ug xuduzhou zsx-jojo srsman jshn232 michaellaurapro kumikoda110 aptxj qiuzeng296 desperadoray zhiyue-archive cloudinbanana ztxjack caimeimei esinker zzz233 shaopengyuan nxren2016 xk6338062 wyr92 mercyhe shubao5612 zhangkunhn jp1017 guhhjj 0x24bin songszw bluepp83 cherryz fendouai mond30081989 shioki0820 sh1ok1 zyyk415 fuyanzhang apcpc niasand zqy1 feibhwang big2cat blackstarry q-o-q everything-is-possible echizen1605 yiliqsmy motuii datafu

python_scripts's Issues

(unicode error) 'utf-8'

File "crawler.py", line 35
"""
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xc5 in position 5: invalid continuation byte

OSError wkhtmltopdf exited with non-zero code 1.error

换了wkhtmltopdf v0.12.4，还是同样的问题。。

AttributeError: 'NoneType' object has no attribute 'get'

实践中是否可以爬取他人的微博数据

看到文档中提到爬取微博数据时需要 cookies，是否意味着非本人账户（无密码）的微博无法爬取？

爬取微博生成云图出错，前段时间还行。今天试了下，报错了

for card in cards:
TypeError: 'NoneType' object is not iterable

廖雪峰教程那个 crawler 第146行是不是有错误

if not m.group(3).startswith("http"):

应该是group(2)吧

Traceback (most recent call last): File "crawler.py", line 163, in <module> crawler.run() File "crawler.py", line 90, in run for index, url in enumerate(self.parse_menu(self.request(self.start_url))): File "crawler.py", line 116, in parse_menu menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]

运行parse_menu报错

请求解决

关于beautifulsoup3不支持python2

关于beautifulsoup3不支持python2，是不是作者写错了。不支持python3吧？

一点小错误

在windows10下pip需要安装beautifulsoup4 不加4 默认安装的是3.

在win ubuntu全提示字符错误

windows
ERROR:root:瑙ｆ瀽閿欒 Traceback (most recent call last): File "crawler.py", line 56, in parse_url_to_html html = html.encode("utf-8") UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 134: ordinal not in range(128)

ubuntu
ERROR:root:解析错误 Traceback (most recent call last): File "crawler.py", line 56, in parse_url_to_html html = html.encode("utf-8") UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 134: ordinal not in range(128) ERROR:root:解析错误 Traceback (most recent call last): File "crawler.py", line 56, in parse_url_to_html html = html.encode("utf-8")

OSError: wkhtmltopdf exited with non-zero code -6. error:

Traceback (most recent call last):
File "crawler.py", line 165, in
crawler.run()
File "crawler.py", line 99, in run
pdfkit.from_file(htmls, self.name + ".pdf", options=options)
File "/usr/local/lib/python3.4/dist-packages/pdfkit/api.py", line 49, in from_file
return r.to_pdf(output_path)
File "/usr/local/lib/python3.4/dist-packages/pdfkit/pdfkit.py", line 159, in to_pdf
raise IOError("wkhtmltopdf exited with non-zero code {0}. error:\n{1}".format(exit_code, stderr))
OSError: wkhtmltopdf exited with non-zero code -6. error:
The switch --outline-depth, is not support using unpatched qt, and will be ignored.QXcbConnection: Could not connect to display

OSError: wkhtmltopdf reported an error: The switch --outline-depth, is not support using unpatched qt, and will be ignored.Error: This version of wkhtmltopdf is build against an unpatched version of QT, and does not support more then one input document. Exit with code 1, due to unknown error.

ubuntu 16.04

$ wkhtmltopdf --version
wkhtmltopdf 0.12.2.4

list index out of range

python3 crawler.py
Traceback (most recent call last):
File "crawler.py", line 163, in
crawler.run()
File "crawler.py", line 90, in run
for index, url in enumerate(self.parse_menu(self.request(self.start_url))):
File "crawler.py", line 116, in parse_menu
menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1]
IndexError: list index out of range

话说这样直接把cookies还有username,password露出来真的好么……

blog/crawler_blog.py

廖雪峰博客转pdf，运行出错

错误信息：OSError: wkhtmltopdf exited with non-zero code 1. error:
You need to specify at least one input file, and exactly one output file

请问，pdfkit是根据什么自动生成目录的？我修改代码后，生成的pdf文件没有生成目录

给h1标签设置居中 body.find('h1')['style'] = "text-align:center;"

#给 h1 tag 设置居中属性
body.find('h1')['style'] = "text-align:center;"

这时候要用

body = soup.find(class_="article-intro")

#body = soup.find_all(class_="article-intro") #如果用find_all 那后面就要用 html = h[1:-1] 去掐头去尾 去掉 [ 和 ]

创建PDF时出错

Traceback (most recent call last):
File "crawler.py", line 165, in
crawler.run()
File "crawler.py", line 99, in run
pdfkit.from_file(htmls, self.name + ".pdf", options=options)
File "/usr/local/lib/python3.5/dist-packages/pdfkit/api.py", line 49, in from_file
return r.to_pdf(output_path)
File "/usr/local/lib/python3.5/dist-packages/pdfkit/pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
The switch --outline-depth, is not support using unpatched qt, and will be ignored.Error: This version of wkhtmltopdf is build against an unpatched version of QT, and does not support more then one input document.
Exit with code 1, due to unknown error.

后半部分的好多图片下载失败，是不是wkhtmltopdf分配的缓存太小了

周半部分的好多图片下载失败，是不是wkhtmltopdf分配的缓存太小了。因为失败的图片总是后半部分的图。而错误信息也没有提示什么有用的：
Traceback (most recent call last):
File "crawler.py", line 165, in
crawler.run()
File "crawler.py", line 99, in run
pdfkit.from_file(htmls, self.name + ".pdf", options=options)
File "D:\Program Files\Python36\lib\site-packages\pdfkit\api.py", line 49, in from_file
return r.to_pdf(output_path)
File "D:\Program Files\Python36\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Failed to load file:///static/img/404.png (ignore)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolUnknownError

I highly recommend you to use 'weasyprint' as an alternative of 'pypdf'

I highly recommend you to use 'weasyprint' as an alternative of 'pypdf' to avoid font size issue.
And as far as I know, ajax pics cannot be extracted for all webpage2pdf modules. :)

关于图片正则表达式的错误的纠正

        def func(m):
            if not m.group(3).startswith("http"):
                rtn = m.group(1) + get_domain(url) + "/" + m.group(2) + m.group(3)
                #rtn = m.group(1) + domain + m.group(2) + m.group(3)
                return rtn
            else:
                return m.group(1) + m.group(2) + m.group(3)
        
        html = re.compile(pattern).sub(func, html)

我发现里面有问题，于是修改为

大家可以看下 https://regex101.com/ 的测试效果
m.group(2) 才是匹配那个网址哦

所以并不是错误的 m.group(3) 那个只是匹配到 ”

而我看不懂那个正侧替换，查参考资料官方是
re.sub(pattern, repl, string, count=0, flags=0)

repl是字符串或者函数

Traceback (most recent call last): File "crawler.py", line 165, in <module> crawler.run() File "crawler.py", line 99, in run pdfkit.from_file(htmls, self.name + ".pdf", options=options) File "/home/kong/.virtualenvs/Py3/lib/python3.5/site-packages/pdfkit/api.py", line 49, in from_file return r.to_pdf(output_path) File "/home/kong/.virtualenvs/Py3/lib/python3.5/site-packages/pdfkit/pdfkit.py", line 159, in to_pdf raise IOError("wkhtmltopdf exited with non-zero code {0}. error:\n{1}".format(exit_code, stderr)) OSError: wkhtmltopdf exited with non-zero code 1. error: Loading pages (1/6) [========> ] 14% (wkhtmltopdf:13716): Gtk-WARNING **: cannot open display:

ImportError: No module named 'pdfkit'

root@raspberrypi:/home/pi/python/crawler_html2pdf/pdf# python3 crawler.py
Traceback (most recent call last):
File "crawler.py", line 14, in
import pdfkit
ImportError: No module named 'pdfkit'

这是为啥？

运行runoob2pdf出错，ImportError: No module named 'click'

明明已经有click了，一运行就报错

runoob2pdf 里面的报错 OSError: No wkhtmltopdf executable found: "b''"

OSError: No wkhtmltopdf executable found: "b''"

那个报错
我环境变量里加的是 D:\Program Files\wkhtmltopdf\bin\
以为是\b这个在python里解析出错的造成的，于是去改成 D:\\Program Files\\wkhtmltopdf\\bin\\ 还是不行。
我参考了老外的问答
http://stackoverflow.com/questions/27673870/cant-create-pdf-using-python-pdfkit-error-no-wkhtmltopdf-executable-found
改成

config = pdfkit.configuration(wkhtmltopdf=r"D:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe")
pdfkit.from_file(htmls, file_name, options=options, configuration=config)

就可以正常运行了

请问只能这样处理吗？

为什么我转出来的pdf字体那么小呢？

在html的时候是正常的，应该是转pdf的时候变了，是options设置不对吗？

在Mac下做的。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 3: invalid continuation byte

在执行转换pdf文件命令时总是报这个错，有人知道吗？查了一下都没有看到解决方法。

运行crawler报错：（unicode error) 'utf-8'

requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(10054, 'WSAECONNRESET')",))

出现这种状况代表什么意思？

用python3运行报错

Traceback (most recent call last):
File "crawler.py", line 56, in parse_url_to_html
f.write(html)
TypeError: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "crawler.py", line 119, in
main()
File "crawler.py", line 108, in main
htmls = [parse_url_to_html(url, str(index) + ".html") for index, url in enumerate(urls)]
File "crawler.py", line 108, in
htmls = [parse_url_to_html(url, str(index) + ".html") for index, url in enumerate(urls)]
File "crawler.py", line 60, in parse_url_to_html
print(e.message)
AttributeError: 'TypeError' object has no attribute 'message'

win7转pdf时候提示wkhtmltopdf reported an error

  File "crawler.py", line 163, in <module>
    crawler.run()
  File "crawler.py", line 97, in run
    pdfkit.from_file(htmls, self.name + ".pdf", options=options)
  File "C:\Anaconda3\envs\py3-dj\lib\site-packages\pdfkit\api.py", line 49, in from_file
    return r.to_pdf(output_path)
  File "C:\Anaconda3\envs\py3-dj\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
    raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Failed to load http://www.liaoxuefeng.comhttp//service.t.sina.com.cn/widget/qmd/1658384301/078cedea/2.png (ignore)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolUnknownError

lzjun567 / python_scripts Goto Github PK

python_scripts's Introduction

目录

Contact me

python_scripts's People

Contributors

Stargazers

Watchers

Forkers

python_scripts's Issues

Recommend Projects

Recommend Topics

Recommend Org