Code Monkey home page Code Monkey logo

mediacrawler's Issues

视频爬取

作者你好,感谢开源代码,看到代码里面现在是爬取评论和笔记,如果想要爬取小红书或者抖音平台中的视频在哪个部分呢,现在代码中有吗~

playwright._impl._api_types.Error: Browser closed.

File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Browser closed.
==================== Browser output: ====================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup
=========================== logs ===========================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup

爬取能正常运行,但是在爬取评论时一条评论信息也爬取不到

报错信息如下 MediaCrawler ERROR aweme_id: 7266050530072481076 get comments failed, error: Expecting value: line 1 column 1 (char 0),
按理说即使又抖音的反爬取机制,但是也应该有一两条数据,但是一条也没有,以下是我修改过后的保存到本地csv的代码:
import json
from typing import Dict, List

from tortoise import fields
from tortoise.models import Model
import os
import config
from tools import utils
import pandas as pd

class DouyinBaseModel(Model):
id = fields.IntField(pk=True, autoincrement=True, description="自增ID")
user_id = fields.CharField(null=True, max_length=64, description="用户ID")
sec_uid = fields.CharField(null=True, max_length=128, description="用户sec_uid")
short_user_id = fields.CharField(null=True, max_length=64, description="用户短ID")
user_unique_id = fields.CharField(null=True, max_length=64, description="用户唯一ID")
nickname = fields.CharField(null=True, max_length=64, description="用户昵称")
avatar = fields.CharField(null=True, max_length=255, description="用户头像地址")
user_signature = fields.CharField(null=True, max_length=500, description="用户签名")
ip_location = fields.CharField(null=True, max_length=255, description="评论时的IP地址")
add_ts = fields.BigIntField(description="记录添加时间戳")
last_modify_ts = fields.BigIntField(description="记录最后修改时间戳")

class Meta:
    abstract = True

class DouyinAweme(DouyinBaseModel):
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
aweme_type = fields.CharField(max_length=16, description="视频类型")
title = fields.CharField(null=True, max_length=500, description="视频标题")
desc = fields.TextField(null=True, description="视频描述")
create_time = fields.BigIntField(description="视频发布时间戳", index=True)
liked_count = fields.CharField(null=True, max_length=16, description="视频点赞数")
comment_count = fields.CharField(null=True, max_length=16, description="视频评论数")
share_count = fields.CharField(null=True, max_length=16, description="视频分享数")
collected_count = fields.CharField(null=True, max_length=16, description="视频收藏数")

class Meta:
    table = "douyin_aweme"
    table_description = "抖音视频"

def __str__(self):
    return f"{self.aweme_id} - {self.title}"

def save_data_to_excel(data: Dict, sheet_name: str):
file_path = 'D:\douyin.xlsx'
if not os.path.exists(file_path):
df = pd.DataFrame(columns=list(data.keys()))
df.to_excel(file_path, sheet_name=sheet_name,index=False, engine='openpyxl')
else:
with pd.ExcelFile(file_path) as xls:

        df_old = pd.read_excel(xls, sheet_name=sheet_name, engine='openpyxl')

        # 使用 pd.concat 替代 append 方法
        df_new = pd.DataFrame([data])
        df_combined = pd.concat([df_old, df_new], ignore_index=True)

        df_combined.to_excel(file_path, sheet_name=sheet_name, index=False, engine='openpyxl')

async def save_aweme_to_excel(aweme_data: Dict):
save_data_to_excel(aweme_data, "aweme")

async def save_comment_to_excel(comment_data: Dict):
save_data_to_excel(comment_data, "comments")

async def save_aweme_to_excel(aweme_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(aweme_data.keys()))

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='aweme', engine='openpyxl')

df = df.append(aweme_data, ignore_index=True)

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

async def save_comment_to_excel(comment_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(comment_data.keys()))

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='comments', engine='openpyxl')

df = df.append(comment_data, ignore_index=True)

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

class DouyinAwemeComment(DouyinBaseModel):
comment_id = fields.CharField(max_length=64, index=True, description="评论ID")
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
content = fields.TextField(null=True, description="评论内容")
create_time = fields.BigIntField(description="评论时间戳")
sub_comment_count = fields.CharField(max_length=16, description="评论回复数")

class Meta:
    table = "douyin_aweme_comment"
    table_description = "抖音视频评论"

def __str__(self):
    return f"{self.comment_id} - {self.content}"

async def update_douyin_aweme(aweme_item: Dict):
aweme_id = aweme_item.get("aweme_id")
user_info = aweme_item.get("author", {})
interact_info = aweme_item.get("statistics", {})
local_db_item = {
"aweme_id": aweme_id,
"aweme_type": aweme_item.get("aweme_type"),
"title": aweme_item.get("desc", ""),
"desc": aweme_item.get("desc", ""),
"create_time": aweme_item.get("create_time"),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": user_info.get("avatar_thumb", {}).get("url_list", [""])[0],
"liked_count": interact_info.get("digg_count"),
"collected_count": interact_info.get("collect_count"),
"comment_count": interact_info.get("comment_count"),
"share_count": interact_info.get("share_count"),
"ip_location": aweme_item.get("ip_label", ""),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme id:{aweme_id}, title:{local_db_item.get('title')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAweme.filter(aweme_id=aweme_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAweme.create(**local_db_item)
else:
await DouyinAweme.filter(aweme_id=aweme_id).update(**local_db_item)
else:
await save_aweme_to_excel(local_db_item)

async def batch_update_dy_aweme_comments(aweme_id: str, comments: List[Dict]):
if not comments:
return
for comment_item in comments:
await update_dy_aweme_comment(aweme_id, comment_item)

async def update_dy_aweme_comment(aweme_id: str, comment_item: Dict):
comment_aweme_id = comment_item.get("aweme_id")
if aweme_id != comment_aweme_id:
print(f"comment_aweme_id: {comment_aweme_id} != aweme_id: {aweme_id}")
return
user_info = comment_item.get("user", {})
comment_id = comment_item.get("cid")
avatar_info = user_info.get("avatar_medium", {}) or user_info.get("avatar_300x300", {}) or user_info.get(
"avatar_168x168", {}) or user_info.get("avatar_thumb", {}) or {}
local_db_item = {
"comment_id": comment_id,
"create_time": comment_item.get("create_time"),
"ip_location": comment_item.get("ip_label", ""),
"aweme_id": aweme_id,
"content": comment_item.get("text"),
"content_extra": json.dumps(comment_item.get("text_extra", [])),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": avatar_info.get("url_list", [""])[0],
"sub_comment_count": comment_item.get("reply_comment_total", 0),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme comment: {comment_id}, content: {local_db_item.get('content')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAwemeComment.filter(comment_id=comment_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAwemeComment.create(**local_db_item)
else:
await DouyinAwemeComment.filter(comment_id=comment_id).update(**local_db_item)
else:
await save_comment_to_excel(local_db_item)

小红书功能反馈

请问能不能更新一个,只从固定用户主页搜索全部笔记的功能。
另外,好像笔记下的评论不能完全爬取保存完整。
不过确实挺好用的,是我找了这么多,唯一可以用的项目

dy登录失败,没有弹出二维码

@NanmiCoder @tanpenggood

C:\Users\caps\.vitualenvs\crawler\Scripts\python.exe main.py --platform dy --lt qrcode 
2023-07-26  22:58:50 MediaCrawler ERROR login dialog box does not pop up automatically, error: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================ 
2023-07-26  22:58:50 MediaCrawler INFO login dialog box does not pop up automatically, we will manually click the login button 
Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 90, in popup_login_dialog
    await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 10)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 8266, in wait_for_selector
    await self._impl_obj.wait_for_selector(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_page.py", line 368, in wait_for_selector
    return await self._main_frame.wait_for_selector(**locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 322, in wait_for_selector
    await self._channel.send("waitForSelector", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 47, in <module>
    asyncio.run(main())
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 42, in main
    await crawler.start()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\core.py", line 62, in start
    await login_obj.begin()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 45, in begin
    await self.popup_login_dialog()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 95, in popup_login_dialog
    await login_button_ele.click()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 15419, in click
    await self._impl_obj.click(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_locator.py", line 160, in click
    return await self._frame.click(self._selector, strict=True, **params)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 489, in click
    await self._channel.send("click", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//p[text() = '登录']")
  locator resolved to <p class="lqiPv8cB">登录</p>
attempting click action
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #1
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #2
  waiting 20ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #3
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #4
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #5
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #6
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #7
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #8
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #9
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #10
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #11
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #12
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #13
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #14
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #15
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #16
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #17
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #18
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #19
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #20
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #21
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #22
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #23
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #24
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #25
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #26
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #27
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #28
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #29
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #30
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #31
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #32
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #33
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #34
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #35
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #36
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #37
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #38
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #39
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #40
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #41
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #42
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #43
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #44
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #45
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #46
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #47
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #48
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #49
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #50
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #51
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #52
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #53
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #54
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #55
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #56
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #57
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #58
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #59
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #60
  waiting 500ms
============================================================

进程已结束,退出代码1

Help me: 项目启动报错

software version
OS mac
python 3.11
$ python main.py --platform xhs --lt qrcode
Traceback (most recent call last):
  File "/Users/xxx/tp-code/MediaCrawler/main.py", line 8, in <module>
    from media_platform.douyin import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/__init__.py", line 1, in <module>
    from .core import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/core.py", line 17, in <module>
    from .login import DouYinLogin
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/login.py", line 6, in <module>
    import aioredis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/__init__.py", line 1, in <module>
    from aioredis.client import Redis, StrictRedis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/client.py", line 32, in <module>
    from aioredis.connection import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/connection.py", line 33, in <module>
    from .exceptions import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/exceptions.py", line 14, in <module>
    class TimeoutError(asyncio.TimeoutError, builtins.TimeoutError, RedisError):
TypeError: duplicate base class TimeoutError

数据输出卡住

运行之后发现只能输出大概200+的数据,然后就不在输出了,主题和评论加起来200+,这是小红书限制了还是其他什么原因呢

douyin/client.py的第142行导致get_video_by_id错误

async def get_video_by_id(self, aweme_id: str):
"""
DouYin Video Detail API
:param aweme_id:
:return:
"""
params = {
"aweme_id": aweme_id
}
headers = copy.copy(self.headers)
headers["Cookie"] = "s_v_web_id=verify_leytkxgn_kvO5kOmO_SdMs_4t1o_B5ml_BUqtWM1mP6BF;"
del headers["Origin"]
return await self.get("/aweme/v1/web/aweme/detail/", params, headers)

使用cookies是否可以不重复登录

您好,在爬取小红书时,我尝试在第一次中使用QRcode登录并且获得到cookies,后续尝试使用获得到的cookies免登陆但是失败了,请问是我的操作有问题,还是免登录的上下文环境不止cookies呢,或者是其他原因?

请问这个爬取的数据可以直接导出为csv吗?因为我在导到数据库中时遇到了挺多问题

以下是遇到的一些问题:

asyncmy.errors.OperationalError: (1054, "Unknown column 'nickname' in 'field list'")

tortoise.exceptions.OperationalError: (1054, "Unknown column 'add_ts' in 'field list'")
………………一些字段缺失(我在sql内补了一些)

tortoise.exceptions.OperationalError: (1054, "Unknown column 'image_list' in 'field list'")
(有一些python的list、dict类型的我不知道在sqln内需要设置成什么)

tortoise.exceptions.OperationalError: (1366, "Incorrect string value: '\xF0\x9F\x8C\xB0' for column 'nickname' at row 1")(我去调整了sql的collation为“utf-8_general_ci”)

asyncmy.errors.DataError: (1406, "Data too long for column 'avatar' at row 1")

tortoise.exceptions.OperationalError: (1406, "Data too long for column 'avatar' at row 1")

python main.py --platform dy --lt qrcode 失败

执行命令 python main.py --platform dy --lt qrcode
其中 --lt 后面能跟的参数都尝试了,结果总是错误:
Traceback (most recent call last):
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 45, in main
await crawler.start()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 66, in start
await self.search()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 78, in get
await self.__process_req_params(params, headers)
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 56, in __process_req_params
"webid": douyin_js_obj.call("get_web_id"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 92, in _call
return self.eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 78, in eval
return self.exec
(code)
^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 18, in exec

return self.exec(source)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 88, in exec
return self._extract_result(output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: SyntaxError: 缺少 ';'

小红书 扫码成功后 报错

小红书 扫码后 报错

如下:
2023-07-08 18:04:13 root INFO Begin login xiaohongshu by qrcode ...
2023-07-08 18:04:23 root INFO waiting for scan code login, remaining time is 20s
Traceback (most recent call last):
File "/Users/username/work/github_test/MediaCrawler/main.py", line 58, in
asyncio.run(main())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/Users/username/work/github_test/MediaCrawler/main.py", line 39, in main
await crawler.start()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/core.py", line 82, in start
await login_obj.begin()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 48, in begin
await self.login_by_qrcode()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 155, in login_by_qrcode
login_flag: bool = await self.check_login_state(no_logged_in_session)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 88, in async_wrapped
return await fn(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 47, in call
do = self.iter(retry_state=retry_state)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/init.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1271e2ef0 state=finished returned bool>]

我在使用cookie登录抖音后遇到了execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.的错误

详细的报错信息如下:(base) yyyy:~/Union/MediaCrawler$ python main.py --platform dy --lt cookie
/yyyy/MediaCrawler/main.py:51: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
2023-08-11 15:08:42 MediaCrawler INFO Begin login douyin by cookie ...
2023-08-11 15:08:48 MediaCrawler INFO login finished then check login state ...
2023-08-11 15:08:48 MediaCrawler INFO Login successful then wait for 5 seconds redirect ...
2023-08-11 15:08:53 MediaCrawler INFO Begin search douyin keywords
2023-08-11 15:08:53 MediaCrawler INFO Current keyword: 健身
Traceback (most recent call last):
File "/yyyy/MediaCrawler/main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "/yyyy/anaconda3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/yyyy/MediaCrawler/main.py", line 45, in main
await crawler.start()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 66, in start
await self.search()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 78, in get
await self.__process_req_params(params, headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 38, in __process_req_params
douyin_js_obj = execjs.compile(open('libs/douyin.js').read())
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/init.py", line 61, in compile
return get().compile(source, cwd)
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 21, in get
return get_from_environment() or _find_available_runtime()
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime
raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.")
execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.

小红书扫码失败了

今天开始,小红书扫码无法登陆,弹出图片,扫码,直接提示失败,重新登录,然后就是二维码过期了。

抖音无法登陆

dy不管使用哪种方法登录都同样报错,xhs正常

好像是playwright的问题,我这边用playwright无法打开抖音首页,把index_url换成www.douyin.com/discover
就正常了

报错Full list of missing libraries:

File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 44, in start
self.browser_context = await self.launch_browser(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 184, in launch_browser
browser_context = await chromium.launch_persistent_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Host system is missing dependencies!

Full list of missing libraries:
chrome_elf.dll

关于redis

大佬,请问一下,抖音评论爬取是不是还不可以通过redis存储到数据库中?
能麻烦加一个windows 的 redis配置过程吗。
please😜

sql插入失败

xhs.model里
"title": note_item.get("title") or note_item.get("desc", "")
有的只有desc没有title,就会导致title字符串过长溢出导致插入失败

报错信息

爬取抖音评论报错: MediaCrawler ERROR aweme_id: xxx get comments failed, error: Expecting value: line 1 column 1 (char 0)
应该如何解决

二维码扫描登录show方法不关闭图片程序会卡住,不再往下执行

tools.utils.show_qrcode()

def show_qrcode(qr_code: str):
    """parse base64 encode qrcode image and show it"""
    qr_code = qr_code.split(",")[1]
    qr_code = base64.b64decode(qr_code)
    image = Image.open(BytesIO(qr_code))

    # Add a square border around the QR code and display it within the border to improve scanning accuracy.
    width, height = image.size
    new_image = Image.new('RGB', (width + 20, height + 20), color=(255, 255, 255))
    new_image.paste(image, (10, 10))
    draw = ImageDraw.Draw(new_image)
    draw.rectangle((0, 0, width + 19, height + 19), outline=(0, 0, 0), width=1)
    new_image.show()

login.login_by_qrcode
换成异步打开二维码,循环校验登录状态会不会体验好些

login_flag: bool = await self.check_login_state(no_logged_in_session)
        if not login_flag:
            # wait 2s
            # login_flag: bool = await self.check_login_state(no_logged_in_session)

Invalid port Bug

报以下错误

Begin search xiaohongshu keywords:  健身
Traceback (most recent call last):
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 339, in normalize_port
    port_as_int = int(port)
ValueError: invalid literal for int() with base 10: ':1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 35, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "main.py", line 30, in main
    await crawler.start()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 70, in start
    note_res = await self.search_posts()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 134, in search_posts
    posts_res = await self.xhs_client.get_note_by_keyword(keyword=self.keywords)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 110, in get_note_by_keyword
    return await self.post(uri, data)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 77, in post
    return await self.request(method="POST", url=f"{self._host}{uri}",
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 53, in request
    async with httpx.AsyncClient(proxies=self.proxies) as client:
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1408, in __init__
    self._mounts: typing.Dict[URLPattern, typing.Optional[AsyncBaseTransport]] = {
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1409, in <dictcomp>
    URLPattern(key): None
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_utils.py", line 397, in __init__
    url = URL(pattern)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urls.py", line 113, in __init__
    self._uri_reference = urlparse(url, **kwargs)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 246, in urlparse
    parsed_port: typing.Optional[int] = normalize_port(port, scheme)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 341, in normalize_port
    raise InvalidURL("Invalid port")
httpx.InvalidURL: Invalid port

无法抓取视频评论

看到3周前其他issue里也有同样问题,作者回答解决了但现在还有相同报错,换了多个关键词测试结果相同,报错如下:

2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7241024491999022392 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7257898852668296485 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7246635327694179623 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7042503193409880590 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7256294498760674612 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7229990392005922081 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler INFO Douyin Crawler finished ...

二维码登录

二维码登录不可用,需要滑块验证了
<Page url='https://www.xiaohongshu.com/website-login/captcha?redirectPath=>

登录已过期

請問大佬, 我使用 掃碼登入或者cookie登入
可以順利登入
不過當開始抓取不到一組資料的時候
就會直接被登出顯示以下資訊
media_platform.xhs.exception.DataFetchError: 登录已

是為什麼呢?

在服务器山运行出错

在服务器上运行没有权限访问,这个在哪配置呢?

2023-09-23  22:30:23 MediaCrawler INFO Begin create browser context ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin create xiaohongshu API client ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin to ping xhs... 
2023-09-23  22:30:25 httpx INFO HTTP Request: POST https://edith.xiaohongshu.com/api/sns/web/v1/search/notes "HTTP/1.1 200 OK" 
2023-09-23  22:30:25 MediaCrawler ERROR Ping xhs failed: 您当前登录的账号没有权限访问, and try to login again... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu by qrcode ... 
2023-09-23  22:30:25 MediaCrawler INFO waiting for scan code login, remaining time is 20s 
<PIL.Image.Image image mode=RGB size=175x175 at 0x7F693C062E00>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.