zoujiu1 / zhihu_spider_selenium Goto Github PK
View Code? Open in Web Editor NEW爬取知乎个人主页的想法、文篇和回答
License: MIT License
爬取知乎个人主页的想法、文篇和回答
License: MIT License
有这样一类回答,当爬取自己的回答列表,获取 url 生成 answer.txt 文件时,可以在作者主页(本人账号登陆的情况下)看到并获取到 url
程序根据 url 链接执行时,由于此回答所关联的问题整个问题被删除(其他用户该问题下所有回答不可见)。
此时,由于无法获取输入 md 文件的内容,程序会直接崩溃。
是否可以有方法解决这类问题,或者说识别这类问题然后继续执行其他后续回答的爬取,并且将这类问题的 url 汇总到新的 txt 文件反馈给用户。
否则会使用默认用户名,替换所有用户名后可以使用。
Requirements中最后一个少了一个等号。这个版本numpy在python 3.12上安装会有麻烦,实测最新版也能用。
感谢博主分享
顺序爬取,当爬到特定问题下,整个程序就会崩溃。
举例网址1“https://www.zhihu.com/question/614902680/answer/3152426894 金融行业用 AI 做量化交易和高频交易靠谱吗?未来会如何发展 ?”
举例网址2“https://www.zhihu.com/question/622572713/answer/3221012170 如何看待某车企的内部规定,要求所有技术人员不能与供应商私自联系 ?”
上述网址回答,与其他问题的主要区别在于“私自联系 ?”和“如何发展 ?”,即最后一个问号前,多了一个空格,我认为可能是这个位置导致无法识别保存路径。
运行命令为python crawler.py --answer --MarkDown
。
以下部分使用最新版代码运行
Traceback (most recent call last):
File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 1142, in <module>
zhihu()
File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 1087, in zhihu
crawl_answer_detail(driver)
File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 954, in crawl_answer_detail
with open(os.path.join(dircrea, nam + ".md"), 'w', encoding='utf-8') as obj:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\24Python\\12_数据抓取\\00知乎\\240502\\git-zhihu_spider_selenium-master\\answer\\金融行业用 AI 做量化交易和高频交易靠谱吗未来会如何发展 \\金融行业用 AI 做量化交易和.md'
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "AnswerItem-editButtonText"))
如果答案无编辑,会无法继续,可通过try语句缓解,是否有次生bug未知
pages = driver.find_elements(By.CLASS_NAME, 'PaginationButton')[-2]
如果只有一页会无法继续,可通过try语句缓解,是否有次生bug未知
报错内容
DevTools listening on ws://127.0.0.1:9922/devtools/browser/8b5cd6db-98dc-4859-a19b-586646e5eccd
[25540:10460:0430/152431.589:ERROR:fallback_task_provider.cc(127)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
[25540:10460:0430/152438.872:ERROR:fallback_task_provider.cc(127)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
Traceback (most recent call last):
File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 1117, in <module>
zhihu()
File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 1053, in zhihu
crawl_answers_links(driver, username)
File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 177, in crawl_answers_links
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "Pagination"))
File "S:\condaenv\getdata310new\lib\site-packages\selenium\webdriver\support\wait.py", line 95, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
GetHandleVerifier [0x00007FF6D98FD8E2+35890]
Microsoft::Applications::Events::FromJSON [0x00007FF6D98BACC2+1330002]
Microsoft::Applications::Events::ILogManager::operator= [0x00007FF6D96AE137+5095]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96F4E7E+159950]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96F4F66+160182]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D972FEF7+401735]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D971474F+289183]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96EA6C7+117015]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D972DAF1+392513]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D9714373+288195]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E9BEE+114238]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E8DAC+110588]
Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E97A1+113137]
GetHandleVerifier [0x00007FF6D99939F4+650564]
Microsoft::Applications::Events::FromJSON [0x00007FF6D97899BC+79948]
Microsoft::Applications::Events::FromJSON [0x00007FF6D9862D4C+969692]
Microsoft::Applications::Events::FromJSON [0x00007FF6D985B485+938773]
GetHandleVerifier [0x00007FF6D99929B5+646405]
Microsoft::Applications::Events::FromJSON [0x00007FF6D98C2E81+1363217]
Microsoft::Applications::Events::FromJSON [0x00007FF6D98BE4F4+1344388]
Microsoft::Applications::Events::FromJSON [0x00007FF6D98BE62B+1344699]
Microsoft::Applications::Events::FromJSON [0x00007FF6D98B5B21+1309105]
BaseThreadInitThunk [0x00007FF970E5257D+29]
RtlUserThreadStart [0x00007FF971E6AA58+40]
自己主页没啥好爬,我想把指定干货用户的内容爬下来,可不可以实现?
output of crawler.py:
https://paste.rs/ADal9.py3
OS: Win 10 22H2 build19045.3324
Edge Version: Version 116.0.1938.62 (Official build) (64-bit)
Selenium Version: 4.11.2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.