Code Monkey home page Code Monkey logo

zhihu_spider_selenium's Issues

“没有知识的荒原”报错

有这样一类回答,当爬取自己的回答列表,获取 url 生成 answer.txt 文件时,可以在作者主页(本人账号登陆的情况下)看到并获取到 url

程序根据 url 链接执行时,由于此回答所关联的问题整个问题被删除(其他用户该问题下所有回答不可见)。

此时,由于无法获取输入 md 文件的内容,程序会直接崩溃。

是否可以有方法解决这类问题,或者说识别这类问题然后继续执行其他后续回答的爬取,并且将这类问题的 url 汇总到新的 txt 文件反馈给用户。

crawler.py中需要修改用户名

否则会使用默认用户名,替换所有用户名后可以使用。

Requirements中最后一个少了一个等号。这个版本numpy在python 3.12上安装会有麻烦,实测最新版也能用。

感谢博主分享

对题目中含有空格的问题,由于文件命名规则问题,下载时会导致崩溃

顺序爬取,当爬到特定问题下,整个程序就会崩溃。
举例网址1“https://www.zhihu.com/question/614902680/answer/3152426894 金融行业用 AI 做量化交易和高频交易靠谱吗?未来会如何发展 ?”
举例网址2“https://www.zhihu.com/question/622572713/answer/3221012170 如何看待某车企的内部规定,要求所有技术人员不能与供应商私自联系 ?”
上述网址回答,与其他问题的主要区别在于“私自联系 ?”和“如何发展 ?”,即最后一个问号前,多了一个空格,我认为可能是这个位置导致无法识别保存路径。

运行命令为python crawler.py --answer --MarkDown
以下部分使用最新版代码运行

Traceback (most recent call last):
  File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 1142, in <module>
    zhihu()
  File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 1087, in zhihu
    crawl_answer_detail(driver)
  File "D:\24Python\12_数据抓取\00知乎\240502\git-zhihu_spider_selenium-master\crawler.py", line 954, in crawl_answer_detail
    with open(os.path.join(dircrea, nam + ".md"), 'w', encoding='utf-8') as obj:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\24Python\\12_数据抓取\\00知乎\\240502\\git-zhihu_spider_selenium-master\\answer\\金融行业用 AI 做量化交易和高频交易靠谱吗未来会如何发展 \\金融行业用 AI 做量化交易和.md'

bug报告

WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "AnswerItem-editButtonText"))
如果答案无编辑,会无法继续,可通过try语句缓解,是否有次生bug未知
pages = driver.find_elements(By.CLASS_NAME, 'PaginationButton')[-2]
如果只有一页会无法继续,可通过try语句缓解,是否有次生bug未知

爬取回答时报错,文章、想法可以爬取

报错内容

DevTools listening on ws://127.0.0.1:9922/devtools/browser/8b5cd6db-98dc-4859-a19b-586646e5eccd
[25540:10460:0430/152431.589:ERROR:fallback_task_provider.cc(127)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
[25540:10460:0430/152438.872:ERROR:fallback_task_provider.cc(127)] Every renderer should have at least one task provided by a primary task provider. If a "Renderer" fallback task is shown, it is a bug. If you have repro steps, please file a new bug and tag it as a dependency of crbug.com/739782.
Traceback (most recent call last):
  File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 1117, in <module>
    zhihu()
  File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 1053, in zhihu
    crawl_answers_links(driver, username)
  File "D:\24Python\zhihu_spider_selenium-master\crawler.py", line 177, in crawl_answers_links
    WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "Pagination"))
  File "S:\condaenv\getdata310new\lib\site-packages\selenium\webdriver\support\wait.py", line 95, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
        GetHandleVerifier [0x00007FF6D98FD8E2+35890]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D98BACC2+1330002]
        Microsoft::Applications::Events::ILogManager::operator= [0x00007FF6D96AE137+5095]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96F4E7E+159950]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96F4F66+160182]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D972FEF7+401735]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D971474F+289183]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96EA6C7+117015]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D972DAF1+392513]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D9714373+288195]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E9BEE+114238]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E8DAC+110588]
        Microsoft::Applications::Events::GUID_t::GUID_t [0x00007FF6D96E97A1+113137]
        GetHandleVerifier [0x00007FF6D99939F4+650564]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D97899BC+79948]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D9862D4C+969692]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D985B485+938773]
        GetHandleVerifier [0x00007FF6D99929B5+646405]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D98C2E81+1363217]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D98BE4F4+1344388]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D98BE62B+1344699]
        Microsoft::Applications::Events::FromJSON [0x00007FF6D98B5B21+1309105]
        BaseThreadInitThunk [0x00007FF970E5257D+29]
        RtlUserThreadStart [0x00007FF971E6AA58+40]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.