daijro / hrequests Goto Github PK

View Code? Open in Web Editor NEW

572.0 12.0 33.0 138 KB

🚀 Web scraping for humans

Home Page: https://daijro.gitbook.io/hrequests/

License: Apache License 2.0

Python 95.37% Batchfile 0.04% Go 4.60%

forhumans gevent grequests http humans playwright playwright-python python python-requests requests

hrequests's People

Stargazers

Watchers

hrequests's Issues

How i can use Browser Automation?

page = hrequests.BrowserSession()

i have tried to use Browser Automation as you given in example but it is showing error like this :
AttributeError: module 'hrequests' has no attribute 'BrowserSession'

and i also tried this example :

session = hrequests.Session(browser='chrome')
resp = session.get('https://quotes.toscrape.com/page/1/')
with resp.render(mock_human=True) as page:
print(page.text)

and it's also throwing error : AttributeError: module 'hrequests' has no attribute 'browser'

hrequests[all] doesn't support Python 3.12

i tried to install with pip install -U hrequests[all] but got this problem:
i have installed these Individual components by visual studio installer:
C++ Cmake tools for Windows
Testing tools core features
C++ Address Sanitizer
command: pip install -U hrequests[all]
Error:
Using cached playwright_stealth-1.0.6-py3-none-any.whl (28 kB)
Building wheels for collected packages: greenlet
Building wheel for greenlet (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [120 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-312
creating build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet_init_.py -> build\lib.win-amd64-cpython-312\greenlet
creating build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform_init_.py -> build\lib.win-amd64-cpython-312\greenlet\platform
creating build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\leakcheck.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_contextvars.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_cpp.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_extension_interface.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_gc.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator_nested.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet_trash.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_leaks.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_stack_saved.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_throw.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_tracing.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_version.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_weakref.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_init_.py -> build\lib.win-amd64-cpython-312\greenlet\tests
running egg_info
writing src\greenlet.egg-info\PKG-INFO
writing dependency_links to src\greenlet.egg-info\dependency_links.txt
writing requirements to src\greenlet.egg-info\requires.txt
writing top-level names to src\greenlet.egg-info\top_level.txt
reading manifest file 'src\greenlet.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files found matching 'benchmarks*.json'
no previously-included directories found matching 'docs_build'
warning: no files found matching '.py' under directory 'appveyor'
warning: no previously-included files matching '.pyc' found anywhere in distribution
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '.so' found anywhere in distribution
warning: no previously-included files matching '.coverage' found anywhere in distribution
adding license file 'LICENSE'
adding license file 'LICENSE.PSF'
adding license file 'AUTHORS'
writing manifest file 'src\greenlet.egg-info\SOURCES.txt'
copying src\greenlet\greenlet.cpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_allocator.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_compiler_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_cpython_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_exceptions.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_greenlet.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_internal.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_refs.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_slp_switch.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state_dict_cleanup.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_support.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\slp_platformselect.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\platform\setup_switch_x64_masm.cmd -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_aarch64_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_alpha_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_amd64_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_ios.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_csky_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_m68k_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_mips_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_macosx.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_riscv_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_s390_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_sparc_sun_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x32_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\tests_test_extension.c -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_test_extension_cpp.cpp -> build\lib.win-amd64-cpython-312\greenlet\tests
running build_ext
building 'greenlet._greenlet' extension
creating build\temp.win-amd64-cpython-312
creating build\temp.win-amd64-cpython-312\Release
creating build\temp.win-amd64-cpython-312\Release\src
creating build\temp.win-amd64-cpython-312\Release\src\greenlet
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DWIN32=1 -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\cppwinrt" /EHsc /Tpsrc/greenlet/greenlet.cpp /Fobuild\temp.win-amd64-cpython-312\Release\src/greenlet/greenlet.obj /EHsr /GT
greenlet.cpp
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(831): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(848): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(867): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(881): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(891): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
src/greenlet/greenlet.cpp(3095): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for greenlet
Running setup.py clean for greenlet
Failed to build greenlet
ERROR: Could not build wheels for greenlet, which is required to install pyproject.toml-based projects

python version: 3.12.2
OS: Windows
How can i fix this error?

Dockerfile with hrequest

I'm building a container using the hrequest library, however, when I try to use the function get, it doesn't work and stops. Thanks for your help!

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

My python crash when trying simple request

import hrequests

resp = hrequests.get('https://www.google.com')

I am using virtual env with python 3.9.6, Macbook with M2 chip. In pycharm interpreter or terminal, I get "Process finished with exit code 137 (interrupted by signal 9: SIGKILL)" (pycharm) and apple report window appears with message that "python quit unexpectedly" This happens during import of the package

Fail to the "You are Chrome headless" test

Hello,

this webpage is famous to test a scrapping bot against anti-bot detection:
https://arh.antoinevastel.com/bots/areyouheadless

I tried hrequests against it, and it's detected as headless when the brower is set to chrome :(

regards

Handling AttributeErrors when parsing many different URLs

Hi, nice work on this library. I'm trying to parse a bunch of pages with it. But I'm running into issues where fetching content that doesn't exist throws an attribute error. Here's an example:

resp = hrequests.get("some_url")
data = {}

try:
    data['url'] = resp.url
    data["canonical"] = resp.html.find("link[@rel='canonical']").url
    data["title"] = resp.html.find("title").text
    data["meta_description"] = resp.html.find("meta[name='description']").text

except AttributeError:
    pass

Because I'm calling .text and .url on these elements, if any elements don't exist in the HTML response, the code throws an AttributeError: 'NoneType' object has no attribute 'text' and the data object will only have content prior to the error, missing any other valid elements. So for example, if there is no <title> element, but the other 3 elements do exist, the data dict will only contain the url and canonical values, it won't have the meta_description.

The attribute error makes sense, but when scraping content at scale, there's going to be errors, edge cases and missing contents. I don't see a way to handle this gracefully. I'm fine having an empty string if the value is missing, or a None type value. Is there a better way to handle this? I can remove the .url and .text properties, but I'd still have to handle it downstream with a bunch of if/else statements, and I'd prefer to just parse out the content early in the pipeline.

Pyinstaller support

Using hrequersts, after creating an exe from my Python script, I get this error at the exe startup:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\frige\AppData\Local\Temp\_MEI441402\hrequests\bin\CR_VERSIONS.json'.

The script works fine.

import hrequests
session = hrequests.Session('chrome', version=103)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0',
    'Accept': '*/*',
    'Accept-Language': 'it-IT,it;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://sell.wethenew.com/login',
    'content-type': 'application/json',
    'Alt-Used': 'sell.wethenew.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'If-None-Match': 'W/17gujz3lxj828',
}

csrf = eval(session.get('https://sell.wethenew.com/api/auth/csrf', headers=headers, proxies=proxy).text)['csrfToken']

That is the command that auto-py-to-exe run:
pyinstaller --noconfirm --onefile --console --hidden-import "discord_webhook" "D:/Dev/Main.py"

Can anybody help me?

I tried to put the hrequest in the hidden-import while using auto-py-to-exe but nothing happened

Plans to integrate/support scrapy or integrate with scrapy-playwright?

Great package! Are you planning to integrate this functionality as middleware/compatible to Scrapy?

Resp.render is not applying proxy

I have observed that when I try to render the content the proxy is not used, because proxy is None in the Response class. Proxy is only working without rendering.

How to interact with the elements of an iframe?

Hello,

It's possible to interact with elements of an iframe? If possible can you give an example ?

Thanks for your help

How to create new fingerprint for each request

Is there any way to create random fingerprint hash for each request.

Content is not fully loaded

I am testing this library with browser automation on some websites and I have observed that for many of them the content that is lazy is not fully loading (images, js scripts that might load the page). I was wandering why might cause this issue.

How use proxy

from auth (user:pass@ip:port)
and ip:port

Jupyter support for BrowserSession

On Windows 10, Python 3.10.1:

import hrequests
page = hrequests.BrowserSession()

Results in the follow exception:

Task exception was never retrieved
future: <Task finished name='Task-7' coro=<Connection.run() done, defined at <redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py:264> exception=NotImplementedError()>
Traceback (most recent call last):
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
    await self._transport.connect()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    raise exc
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
    transport = await self._make_subprocess_transport(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
Exception in thread Thread-5 (spawn_main):
Traceback (most recent call last):
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 115, in spawn_main
    asyncio.new_event_loop().run_until_complete(self.main())
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
    return future.result()
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 119, in main
    self.client = await hrequests.PlaywrightMock(
  File "<redacted>\hrequests\venv\lib\site-packages\async_class.py", line 173, in __await__
    yield from self.create_task(
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\playwright_mock\playwright_mock.py", line 19, in __ainit__
    self.playwright = await async_playwright().start()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 52, in start
    return await self.__aenter__()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 47, in __aenter__
    playwright = AsyncPlaywright(next(iter(done)).result())
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
    await self._transport.connect()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    raise exc
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
    transport = await self._make_subprocess_transport(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError

Add socks5 proxies [tls-client v1.7.0]

Hello,

everytime I try to use socks5 proxies I got a Connection Error:

>>> get(
...     "https://ipv4.webshare.io/",
...     proxies={
...         "http": "socks5h://XXXX-rotate:[email protected]:80/",
...         "https": "socks5h://XXXX-rotate:[email protected]:80/"
...     }
... ).text
Traceback (most recent call last):
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 72, in execute_request
    resp = self.session.execute_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/client.py", line 389, in execute_request
    raise ClientException(response_object['body'])
hrequests.exceptions.ClientException: failed to build client out of request input: scheme socks5h is not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 209, in request
    req.send()
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 126, in send
    raise e
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 123, in send
    self.response = self.session.request(self.method, self.url, **merged_kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/session.py", line 181, in request
    proc.send()
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 65, in send
    self.response = self.execute_request()
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 80, in execute_request
    raise ClientException('Connection error') from e
hrequests.exceptions.ClientException: Connection error

plans for an async class?

what are the chances you can expose an async interface similar to httpx's https://www.python-httpx.org/async/ client?

The library I have been dreaming for, thank you.

Quick question, how do I use the html parser once the browser has loaded? or even just evaluate with the render function and pass the evaluated html to the parser

I'm trying to get the value of a rendered js object

thank you for putting out a cohesive package for automation, def here early before it blows up

Unsupported chrome version error

Hi there, when running a simple hrequests.get command on Ubuntu I get the following error:

<bound method ? of <class 'hrequests.session.chrome'>>` is not a supported chrome version: (103, 104, 105, 106, 107, 108, 109, 110, 111, 112)

This happens when both installing via hrequests[all] and hrequests.

Was wondering if anyone else has run into this, or could help me debug?

Thanks!

AttributeError: dlsym(0x8546e980, DestroySession): symbol not found

Hi, I'm unable to import hrequests using the latest beta version: 0.8.0-beta b1af435

I get the following exception:

AttributeError                            Traceback (most recent call last)
/Users/libre/Documents/GitHub/project_env/playground.ipynb Cell 43 line 1
----> 1 import hrequests

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/__init__.py:1
----> 1 from .response import Response, ProcessResponse
      2 from .session import Session, TLSSession, chrome, firefox
      3 from .reqs import *

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/response.py:11
      8 from orjson import dumps, loads
     10 import hrequests
---> 11 from hrequests.cffi import PORT
     12 from hrequests.exceptions import ClientException
     14 from .cookies import RequestsCookieJar

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/cffi.py:112
    109 del libman
    111 # extract the exposed destroySession function
--> 112 library.DestroySession.argtypes = [GoString]
    113 library.DestroySession.restype = ctypes.c_void_p
    116 def destroySession(session_id: str):

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/ctypes/__init__.py:389, in CDLL.__getattr__(self, name)
...
--> 394     func = self._FuncPtr((name_or_ordinal, self))
    395     if not isinstance(name_or_ordinal, int):
    396         func.__name__ = name_or_ordinal

AttributeError: dlsym(0x8546e980, DestroySession): symbol not found

TypeError: string indices must be integers

Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hrequests
Downloading tls-client library from bogdanfinn/tls-client...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\hrequests\__init__.py", line 2, in <module>
    from .session import Session, TLSSession, chrome, firefox
  File "C:\Python310\lib\site-packages\hrequests\session.py", line 11, in <module>
    from .cffi import freeMemory
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 90, in <module>
    libman = LibraryManager()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 35, in __init__
    filename = self.check_library()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 54, in check_library
    self.download_library()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 63, in download_library
    if self.file_cont in asset['name'] and asset['name'].endswith(self.file_ext):
TypeError: string indices must be integers

This also happens when you install both hrequests and hrequests[all]

Render Screenshot - page is not fully rendered

I was working through the README; excellent package. In the past I have used playwright to headlessly render a page and take a screenshot. I attempted to replicate this in your library but the screenshot captured appears to lack the javascript rendering.

session = hrequests.Session(browser='chrome')
resp = session.get('https://www.bentley.edu/undergraduate')
page = resp.render(mock_human=True)
page.awaitNavigation()
page.screenshot('test.png', full_page=True)

It likely could be user error, but I am not sure what the best path would be to ensure that the page renders before grabbing a screenshot.

test

Can we use auth proxies?

How could we accomplish something like this using hrequests?

import requests

proxies = {
   'http': 'http://proxy.example.com:8080',
   'https': 'http://proxy.example.com:8081',
}

response = requests.get('http://httpbin.org/ip', proxies=proxies auth=('USERNAME', 'PASSWORD'))

Browser Extension interaction

You currently have the ability for browsers to use extensions such as adblockers. However, there are a few extensions that I want to use that require active input (pressing a couple of buttons). Could there be a way to interact with these extensions?

Potential Memory Leak in Proxied Session Responses

Hi daijro,

First, I want to thank you for this fantastic tool—it's been incredibly useful for my projects!

I've encountered a potential memory leak when rendering responses in a proxied session. Here’s a simple script that replicates the issue:

import hrequests

proxy = "http://proxy-url:port" 
url = "https://www.google.com/"
n = n

for iteration in range(n):
    with hrequests.Session() as session:
        session.proxy = proxy
        response = session.get(url, verify=True)
        rendered_response = response.render(mock_human=True)
        print(iteration, rendered_response.cookies)
        rendered_response.close()

I'm using Bloomberg's Memray to profile the script, and I've observed a significant memory increase over multiple iterations.

For n = 100, the memory profile looks like this:

For n = 400, the memory profile shows a considerable increase:

Is this behavior expected, or could it indicate a memory leak? Any insights would be greatly appreciated.
Open files can be checked by the command below.

lsof -U +c 15 | cut -f1 -d' ' | sort | uniq -c | sort -rn | head -6

Thank you!

Environment:

OS: Ubuntu 22.04.4 LTS x86_64
Python version: 3.8.10
Python module versions:

hrequests: 0.8.2
rich: 13.7.1
aioprocessing: 2.0.1
click: 8.1.7
faust-cchardet: 2.1.19
selectolax: 0.3.21
orjson: 3.10.5
parse: 1.20.2
urllib3: 1.26.5
httpx: 0.27.0
gevent: 24.2.1
numpy: 1.21.5
geventhttpclient: 2.3.1
async-class: 0.5.0
pycryptodome: 3.20.0
playwright: 1.34.0
playwright-stealth: 1.0.6
greenlet: 2.0.2
pyee: 9.0.4
typing-extensions: 4.11.0
zope.event: 5.0
zope.interface: 6.3
brotli: 1.0.9
certifi: 2024.2.2
sniffio: 1.3.1
httpcore: 1.0.5
anyio: 4.4.0
idna: 3.3
h11: 0.14.0
pygments: 2.18.0
markdown-it-py: 3.0.0
mdurl: 0.1.2
exceptiongroup: 1.2.1
setuptools: 59.6.0

How to deal with web page dialogs (alert, confirm etc.)?

I did not find any information related to this topic in the docs. I have some sites that use a native alert box before loading the full page. I need to click on the accept button in that native dialog in order for the page to load. Is that possible?
Cheers

Lambda Execution Issues

Hey there! Awesome library! I am running into some issues. I hope the community here can help me troubleshoot them. I am attempting to run hrequests in Lambda to interact with specific web pages when a function URL is called.

I am using the AWS SDK to deploy a Docker container similar to the following to ECR -> Lambda:

FROM mcr.microsoft.com/playwright/python:v1.34.0-jammy

# Include global arg in this stage of the build
ARG FUNCTION_DIR

RUN mkdir -p ${FUNCTION_DIR}

COPY app.py ${FUNCTION_DIR}

WORKDIR /app

COPY ./mytool/pyproject.toml ./mytool/poetry.lock /app/

COPY ./mytool/. /app

# Install dependencies using poetry
RUN pip install --no-cache-dir poetry awslambdaric aws-xray-sdk sh \
    && poetry config virtualenvs.create false \
    && poetry install --no-interaction --no-ansi

RUN python -m playwright install-deps
RUN python -m playwright install

WORKDIR ${FUNCTION_DIR}

ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

An app.py file similar to the following is then called using said function URL via awslambdaric:

def handler(event, context):
    logger.debug(msg=f"Initial event: {event}")

    headers = event["headers"]
    header_validation = validate_headers(headers)

    input = headers["x-input"]
    try:
        command = headers["x-command"].split()
        command.extend(input.split())
    except Exception as e:
        logger.error(msg=f"Error parsing command: {e}")
        return {
            "statusCode": 500,
            "body": f"Error parsing command: {e}",
        }

    parsed = []
    try:
        logger.debug(msg=f"Running command: {command}")

        # Set HOME=/tmp to avoid writing to the container filesystem
        # Set LD_LIBRARY_PATH to include /usr/lib64 to avoid issues with the AWS X-Ray daemon
        os.environ["HOME"] = "/tmp"
        os.environ["LD_LIBRARY_PATH"] = "/usr/lib64"

        results = subprocess.run(command, capture_output=True, text=True, env=os.environ.copy())
        logger.debug(msg=f"Results stdout: {results.stdout}")
        logger.debug(msg=f"Results stderr: {results.stderr}")
        logger.debug(msg=f"Command exited with code: {results.returncode}")

    except subprocess.TimeoutExpired as e:
        logger.error(msg=f"Command timed out: {e}")
        return {
            "statusCode": 408,  # HTTP status code for Request Timeout
            "body": json.dumps({
                "stdout": str(e.stdout),
                "stderr": str(e.stderr),
                "e": str(e),
                "error": "Command timed out"
            }),
        }
    except Exception as e:
        logger.error(msg=f"Error executing command: {e}")
        return {
            "statusCode": 500,
            "body": f"Error executing command: {e}",
        }

    try:
        for line in results.stdout.splitlines():
            parsed_json = json.loads(line)
            logger.debug(msg=f"Output: {parsed_json}")
            parsed.append(parsed_json)
    except Exception as e:
        logger.error(msg=f"Error parsing output: {e}")
        return {
            "statusCode": 500,
            "body": f"Error parsing output: {e}",
        }
    
    xray_recorder.end_segment()

    return {"statusCode": 200, "body": json.dumps(parsed)}

This app.py code is calling a separate tool I have created that utilizes hrequests for navigation and interaction with web pages. When calling the app.py file with the function URL, however, the following error is returned from hrequests specifically:

Exception in thread Thread-1 (spawn_main):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 128, in spawn_main
    asyncio.new_event_loop().run_until_complete(self.main())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 135, in main
    self.context = await self.client.new_context(
  File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/playwright_mock.py", line 38, in new_context
    _browser = await context.new_context(
  File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/context.py", line 6, in new_context
    context = await inst.main_browser.new_context(
  File "/usr/local/lib/python3.10/dist-packages/playwright/async_api/_generated.py", line 14154, in new_context
    await self._impl_obj.new_context(
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_browser.py", line 127, in new_context
    channel = await self._channel.send("newContext", params)
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 482, in wrap_api_call
    return await cb()
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 97, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

Some notes on what has already been attempted:

The container image runs just fine on my local system with similar resource allocations specified
I can call my tool remotely, and it appears to run partially before hitting this exception
I have increased memory allocation to the Lambda function several times without success.
My tool is always hitting the lambda timeout value set no matter how high so I suspect this error is occurring and locking the application entirely.

I am not experienced with playwright and headless browser usage, so any help would be greatly appreciated. I understand this is not directly related to hrequests, but I hope the community here is familiar enough with the frameworks to assist. Thanks!

How to wait cloudflare checking using hrequests?

hrequests-cgo-2.1-windows-4.0-amd64.dll Error

Hi ,
Thanks for sharing your knowledge .
I get the following error when running the code

python : 3.10
os : windows 11

H:\Python310\ib\site-packages\hrequests\bin\hrequests-cgo-2.1-windows-4.0-amd64.dll is either not designed to run on
Windows or it contains an error. Try installing the program again using the original installation media or contact your
system administrator or the software vendor for support. Error status 0xc0000020.

r.content to Bytes or r.raw is needed for binaries.

Hi, I have tried downloading images, the request.content is not raw bytes?
I've tried mangling the r.content to .encode('utf-8') / decoding(uncode-escapes), but to no avail. The response is always some weird mix of encoded and escaped string and bytes.
The Request class implementation is different from requests.Request() and does not support requests.Request.raw and r.content is not Bytes.

Is there any workaround? Cheers

Sample code:

url = f"https://upload.wikimedia.org/wikipedia/commons/5/59/Shrine_of_Rememberance_%2811884180023%29.jpg"
filePath = os.path.join("tmp", f"image.jpg")
while True:
    r = hrequests.request("GET", url)
    if r.status_code != 403:
        data = r.content.encode('utf-8') 
        with open(filePath, "wb") as f:
            f.write(data)
        break

Output from r:

Resulting binary r.content.encode('utf-8'):

[Update] Update to fit to Botright v0.3

Botright v0.3 released today, you can update the playwright_mock.
Especially the Mouse Movement is noticeably improved.

Overriding encoding

In the requests library, if the wrong type of encoding is used then it's possible to manually fix this by overriding the request.encoding attribute.

In hrequests, this is unfortunately not possible, as doing so leads to the following exception:

request.encoding = 'euc_kr'
AttributeError: property 'encoding' of 'Response' object has no setter

I was wondering if there's any other way to achieve this manual override, as some websites I'm working with are being encoded incorrectly.

Browser Session - Error

Good Morning

OS: Ubuntu 20.04
hrequests version: 0.8.2

By using the hrequests.firefox.BrowserSession() I am receiving the error:

on 0: Exception in thread Thread-2 (spawn_main): Traceback (most recent call last): File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 128, in spawn_main asyncio.new_event_loop().run_until_complete(self.main()) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete return future.result() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 135, in main self.context = await self.client.new_context( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/playwright_mock.py", line 35, in new_context _proxy = await ProxyManager(self, proxy) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/async_class.py", line 173, in __await__ yield from self.create_task( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/proxy_manager.py", line 25, in __ainit__ self.timeout = httpx.Timeout(20.0, read=None) on 0: TypeError: Timeout.__init__() got an unexpected keyword argument 'read' Processing Tags | | ▄▂▂ 0/3 [0%] in 7:48 (~0s, 0.0/s)

It's related to the dependency and not with the code.

I am reporting for further analysis.

Have a great day.

None Type Error

Thanks for Library,

Faced this error in Windows

  File "C:\Users\Chetan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\hrequests\response.py", line 127, in Response
    elapsed: timedelta | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Fix was to set elapsed = None. Please add it to library to help people.

Browser version selection not showing in request Header

How to reproduce issue:
session = hrequests.Session(browser='chrome', version=112)
resp = session.get("https://httpbin.org/headers")
print(resp.json())
It will show a different version in Chrome/114.0.5731.1, Instead of expected Chrome/112.X.X.X

No Binaries for M1 Macs

Installing version 0.8.0 or above doesn't work on M1 (or presumably M2) Macs, due to the missing binaries.

/lib/python3.10/site-packages/hrequests/cffi.py", line 98, in download_library
    raise IOError('Could not find a matching binary for your system.')
OSError: Could not find a matching binary for your system.

Temporary workaround is to install:

pip install -U "hrequests[all]<0.8.0"

Are there plans to include these binaries? Or perhaps some instructions on how to compile them myself?

allow_redirects parameter does not work as expected

allow_redirects is supposedly set as True by default, however it appears by default it is set to False.
Even when I enable it I see a strange behavior, where the redirects are not properly followed

Cookies not properly set in session

I stumbled on a case that indicates that cookies are not properly set in session.

Example url: https://somo.app

If you open url in browser you can see that first requests returns 307 redirect to '/' and sets cookie, subsequent request is to the same url but with cookie and it returns 200.

Trying to open this url with hrequests will fail with:
hrequests.exceptions.ClientException: failed to do request: Get "/": stopped after 10 redirects

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=True)

if I try to do requests by request

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))

resp = session.get(url, allow_redirects=False)
print(resp.status_code)

It shows that subsequent requests returns 307 again. But it should not cookie should be set and second request should return 200.

Getting and setting cookie manually produces expected behavior

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))

headers = {"Cookie": ";".join([f"{c.name}={c.value}" for c in resp.cookies])}
resp = session.get(url, allow_redirects=False, headers=headers)
print(resp.status_code)

Contact with me

I'm sorry to write here, but can you contact me about something on discord @max_andolini

Speed benchmarks with `async_get`?

Hey, I am using async_get with hrequests.imap_enum(reqs) to scrape 100 URLs at a time. The readme says this should be blazing fast, but I'm not sure what that means. It's currently taking 3-5 minutes, and that does not include a render step.

Is that approximately the amount of time it should take? I was thinking it'd be a lot faster since it's just a request and not a rendering. Here's the code I'm currently using. The rows variable is 100 DB records, specifically a SQL Alchemy RowMapping object.

reqs = [hrequests.async_get(r.url) for r in rows]

responses = []

for index, resp in hrequests.imap_enum(reqs):
    if resp:
        with open(f'{directory}/{index}.pickle', 'wb') as file:
            pickle.dump(resp, file)
        responses.append({"url": resp.url, "resp": resp})
        
    else:
        print(f'No response for {index}')

Migrate from pyquery to selectolax

This issue serves as a todo item for migrating from PyQuery to selectolax, a faster, more modern and capable HTML parsing library.

Screenshot for element?

I think screenshots should be taken for specific elements instead of the entire page and used in memory without saving.

Like this;

image_element = page.find('#captchaframe')
image = image_element.screenshot()

Required dependencie missing when installing the library

OS: Windows 11.0
Python version: 3.11.4

When installing the "hrequests" package using the pip command and executing some related code, some specific errors are observed, such as:

ModuleNotFoundError: No module named 'bs4'
ModuleNotFoundError: No module named 'BeautifulSoup'

It is noticeable that, even when using only the "get" function provided by the "hrequests" package, the "bs4" library is requested, and even after trying to install the "hrequests[all]" package, the same errors persist. However, when the "bs4" library is installed manually, the code works without problems. This is a rather inconvenient problem to solve every time I use this library in another project.

daijro / hrequests Goto Github PK

hrequests's People

Stargazers

Watchers

Forkers

hrequests's Issues

Environment:

Recommend Projects

Recommend Topics

Recommend Org