daijro / hrequests Goto Github PK
View Code? Open in Web Editor NEW๐ Web scraping for humans
Home Page: https://daijro.gitbook.io/hrequests/
License: Apache License 2.0
๐ Web scraping for humans
Home Page: https://daijro.gitbook.io/hrequests/
License: Apache License 2.0
page = hrequests.BrowserSession()
i have tried to use Browser Automation as you given in example but it is showing error like this :
AttributeError: module 'hrequests' has no attribute 'BrowserSession'
and i also tried this example :
session = hrequests.Session(browser='chrome')
resp = session.get('https://quotes.toscrape.com/page/1/')
with resp.render(mock_human=True) as page:
print(page.text)
and it's also throwing error : AttributeError: module 'hrequests' has no attribute 'browser'
i tried to install with pip install -U hrequests[all] but got this problem:
i have installed these Individual components by visual studio installer:
C++ Cmake tools for Windows
Testing tools core features
C++ Address Sanitizer
command: pip install -U hrequests[all]
Error:
Using cached playwright_stealth-1.0.6-py3-none-any.whl (28 kB)
Building wheels for collected packages: greenlet
Building wheel for greenlet (setup.py) ... error
error: subprocess-exited-with-error
ร python setup.py bdist_wheel did not run successfully.
โ exit code: 1
โฐโ> [120 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-312
creating build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet_init_.py -> build\lib.win-amd64-cpython-312\greenlet
creating build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform_init_.py -> build\lib.win-amd64-cpython-312\greenlet\platform
creating build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\leakcheck.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_contextvars.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_cpp.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_extension_interface.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_gc.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator_nested.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet_trash.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_leaks.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_stack_saved.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_throw.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_tracing.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_version.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_weakref.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_init_.py -> build\lib.win-amd64-cpython-312\greenlet\tests
running egg_info
writing src\greenlet.egg-info\PKG-INFO
writing dependency_links to src\greenlet.egg-info\dependency_links.txt
writing requirements to src\greenlet.egg-info\requires.txt
writing top-level names to src\greenlet.egg-info\top_level.txt
reading manifest file 'src\greenlet.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files found matching 'benchmarks*.json'
no previously-included directories found matching 'docs_build'
warning: no files found matching '.py' under directory 'appveyor'
warning: no previously-included files matching '.pyc' found anywhere in distribution
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '.so' found anywhere in distribution
warning: no previously-included files matching '.coverage' found anywhere in distribution
adding license file 'LICENSE'
adding license file 'LICENSE.PSF'
adding license file 'AUTHORS'
writing manifest file 'src\greenlet.egg-info\SOURCES.txt'
copying src\greenlet\greenlet.cpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_allocator.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_compiler_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_cpython_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_exceptions.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_greenlet.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_internal.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_refs.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_slp_switch.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state_dict_cleanup.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_support.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\slp_platformselect.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\platform\setup_switch_x64_masm.cmd -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_aarch64_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_alpha_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_amd64_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_ios.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_csky_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_m68k_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_mips_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_macosx.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_riscv_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_s390_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_sparc_sun_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x32_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\tests_test_extension.c -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_test_extension_cpp.cpp -> build\lib.win-amd64-cpython-312\greenlet\tests
running build_ext
building 'greenlet._greenlet' extension
creating build\temp.win-amd64-cpython-312
creating build\temp.win-amd64-cpython-312\Release
creating build\temp.win-amd64-cpython-312\Release\src
creating build\temp.win-amd64-cpython-312\Release\src\greenlet
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DWIN32=1 -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\cppwinrt" /EHsc /Tpsrc/greenlet/greenlet.cpp /Fobuild\temp.win-amd64-cpython-312\Release\src/greenlet/greenlet.obj /EHsr /GT
greenlet.cpp
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(831): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(848): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(867): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(881): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(891): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
src/greenlet/greenlet.cpp(3095): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for greenlet
Running setup.py clean for greenlet
Failed to build greenlet
ERROR: Could not build wheels for greenlet, which is required to install pyproject.toml-based projects
python version: 3.12.2
OS: Windows
How can i fix this error?
I'm building a container using the hrequest library, however, when I try to use the function get, it doesn't work and stops. Thanks for your help!
My python crash when trying simple request
import hrequests
resp = hrequests.get('https://www.google.com')
I am using virtual env with python 3.9.6, Macbook with M2 chip. In pycharm interpreter or terminal, I get "Process finished with exit code 137 (interrupted by signal 9: SIGKILL)" (pycharm) and apple report window appears with message that "python quit unexpectedly" This happens during import of the package
Hello,
this webpage is famous to test a scrapping bot against anti-bot detection:
https://arh.antoinevastel.com/bots/areyouheadless
I tried hrequests against it, and it's detected as headless when the brower is set to chrome :(
regards
Hi, nice work on this library. I'm trying to parse a bunch of pages with it. But I'm running into issues where fetching content that doesn't exist throws an attribute error. Here's an example:
resp = hrequests.get("some_url")
data = {}
try:
data['url'] = resp.url
data["canonical"] = resp.html.find("link[@rel='canonical']").url
data["title"] = resp.html.find("title").text
data["meta_description"] = resp.html.find("meta[name='description']").text
except AttributeError:
pass
Because I'm calling .text
and .url
on these elements, if any elements don't exist in the HTML response, the code throws an AttributeError: 'NoneType' object has no attribute 'text'
and the data
object will only have content prior to the error, missing any other valid elements. So for example, if there is no <title>
element, but the other 3 elements do exist, the data
dict will only contain the url
and canonical
values, it won't have the meta_description
.
The attribute error makes sense, but when scraping content at scale, there's going to be errors, edge cases and missing contents. I don't see a way to handle this gracefully. I'm fine having an empty string if the value is missing, or a None
type value. Is there a better way to handle this? I can remove the .url
and .text
properties, but I'd still have to handle it downstream with a bunch of if/else statements, and I'd prefer to just parse out the content early in the pipeline.
Using hrequersts, after creating an exe from my Python script, I get this error at the exe startup:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\frige\AppData\Local\Temp\_MEI441402\hrequests\bin\CR_VERSIONS.json'.
The script works fine.
import hrequests
session = hrequests.Session('chrome', version=103)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0',
'Accept': '*/*',
'Accept-Language': 'it-IT,it;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://sell.wethenew.com/login',
'content-type': 'application/json',
'Alt-Used': 'sell.wethenew.com',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'If-None-Match': 'W/17gujz3lxj828',
}
csrf = eval(session.get('https://sell.wethenew.com/api/auth/csrf', headers=headers, proxies=proxy).text)['csrfToken']
That is the command that auto-py-to-exe run:
pyinstaller --noconfirm --onefile --console --hidden-import "discord_webhook" "D:/Dev/Main.py"
Can anybody help me?
I tried to put the hrequest in the hidden-import while using auto-py-to-exe but nothing happened
Great package! Are you planning to integrate this functionality as middleware/compatible to Scrapy?
I have observed that when I try to render the content the proxy is not used, because proxy is None in the Response class. Proxy is only working without rendering.
Hello,
It's possible to interact with elements of an iframe? If possible can you give an example ?
Thanks for your help
Is there any way to create random fingerprint hash for each request.
I am testing this library with browser automation on some websites and I have observed that for many of them the content that is lazy is not fully loading (images, js scripts that might load the page). I was wandering why might cause this issue.
from auth (user:pass@ip:port)
and ip:port
On Windows 10, Python 3.10.1:
import hrequests
page = hrequests.BrowserSession()
Results in the follow exception:
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<Connection.run() done, defined at <redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py:264> exception=NotImplementedError()>
Traceback (most recent call last):
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
await self._transport.connect()
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
raise exc
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
self._proc = await asyncio.create_subprocess_exec(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
transport = await self._make_subprocess_transport(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
raise NotImplementedError
NotImplementedError
Exception in thread Thread-5 (spawn_main):
Traceback (most recent call last):
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 115, in spawn_main
asyncio.new_event_loop().run_until_complete(self.main())
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
return future.result()
File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 119, in main
self.client = await hrequests.PlaywrightMock(
File "<redacted>\hrequests\venv\lib\site-packages\async_class.py", line 173, in __await__
yield from self.create_task(
File "<redacted>\hrequests\venv\lib\site-packages\hrequests\playwright_mock\playwright_mock.py", line 19, in __ainit__
self.playwright = await async_playwright().start()
File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 52, in start
return await self.__aenter__()
File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 47, in __aenter__
playwright = AsyncPlaywright(next(iter(done)).result())
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
await self._transport.connect()
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
raise exc
File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
self._proc = await asyncio.create_subprocess_exec(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
transport = await self._make_subprocess_transport(
File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
raise NotImplementedError
NotImplementedError
Hello,
everytime I try to use socks5 proxies I got a Connection Error:
>>> get(
... "https://ipv4.webshare.io/",
... proxies={
... "http": "socks5h://XXXX-rotate:[email protected]:80/",
... "https": "socks5h://XXXX-rotate:[email protected]:80/"
... }
... ).text
Traceback (most recent call last):
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 72, in execute_request
resp = self.session.execute_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/client.py", line 389, in execute_request
raise ClientException(response_object['body'])
hrequests.exceptions.ClientException: failed to build client out of request input: scheme socks5h is not supported
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 209, in request
req.send()
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 126, in send
raise e
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 123, in send
self.response = self.session.request(self.method, self.url, **merged_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/session.py", line 181, in request
proc.send()
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 65, in send
self.response = self.execute_request()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 80, in execute_request
raise ClientException('Connection error') from e
hrequests.exceptions.ClientException: Connection error
what are the chances you can expose an async interface similar to httpx's https://www.python-httpx.org/async/ client?
Quick question, how do I use the html parser once the browser has loaded? or even just evaluate with the render function and pass the evaluated html to the parser
I'm trying to get the value of a rendered js object
thank you for putting out a cohesive package for automation, def here early before it blows up
Hi there, when running a simple hrequests.get
command on Ubuntu I get the following error:
<bound method ? of <class 'hrequests.session.chrome'>>` is not a supported chrome version: (103, 104, 105, 106, 107, 108, 109, 110, 111, 112)
This happens when both installing via hrequests[all]
and hrequests
.
Was wondering if anyone else has run into this, or could help me debug?
Thanks!
Hi, I'm unable to import hrequests
using the latest beta version: 0.8.0-beta b1af435
I get the following exception:
AttributeError Traceback (most recent call last)
/Users/libre/Documents/GitHub/project_env/playground.ipynb Cell 43 line 1
----> 1 import hrequests
File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/__init__.py:1
----> 1 from .response import Response, ProcessResponse
2 from .session import Session, TLSSession, chrome, firefox
3 from .reqs import *
File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/response.py:11
8 from orjson import dumps, loads
10 import hrequests
---> 11 from hrequests.cffi import PORT
12 from hrequests.exceptions import ClientException
14 from .cookies import RequestsCookieJar
File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/cffi.py:112
109 del libman
111 # extract the exposed destroySession function
--> 112 library.DestroySession.argtypes = [GoString]
113 library.DestroySession.restype = ctypes.c_void_p
116 def destroySession(session_id: str):
File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/ctypes/__init__.py:389, in CDLL.__getattr__(self, name)
...
--> 394 func = self._FuncPtr((name_or_ordinal, self))
395 if not isinstance(name_or_ordinal, int):
396 func.__name__ = name_or_ordinal
AttributeError: dlsym(0x8546e980, DestroySession): symbol not found
Python 3.10.5 (tags/v3.10.5:f377153, Jun 6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hrequests
Downloading tls-client library from bogdanfinn/tls-client...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python310\lib\site-packages\hrequests\__init__.py", line 2, in <module>
from .session import Session, TLSSession, chrome, firefox
File "C:\Python310\lib\site-packages\hrequests\session.py", line 11, in <module>
from .cffi import freeMemory
File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 90, in <module>
libman = LibraryManager()
File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 35, in __init__
filename = self.check_library()
File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 54, in check_library
self.download_library()
File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 63, in download_library
if self.file_cont in asset['name'] and asset['name'].endswith(self.file_ext):
TypeError: string indices must be integers
This also happens when you install both hrequests and hrequests[all]
I was working through the README; excellent package. In the past I have used playwright to headlessly render a page and take a screenshot. I attempted to replicate this in your library but the screenshot captured appears to lack the javascript rendering.
session = hrequests.Session(browser='chrome')
resp = session.get('https://www.bentley.edu/undergraduate')
page = resp.render(mock_human=True)
page.awaitNavigation()
page.screenshot('test.png', full_page=True)
It likely could be user error, but I am not sure what the best path would be to ensure that the page renders before grabbing a screenshot.
How could we accomplish something like this using hrequests?
import requests
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://proxy.example.com:8081',
}
response = requests.get('http://httpbin.org/ip', proxies=proxies auth=('USERNAME', 'PASSWORD'))
You currently have the ability for browsers to use extensions such as adblockers. However, there are a few extensions that I want to use that require active input (pressing a couple of buttons). Could there be a way to interact with these extensions?
Hi daijro,
First, I want to thank you for this fantastic toolโit's been incredibly useful for my projects!
I've encountered a potential memory leak when rendering responses in a proxied session. Hereโs a simple script that replicates the issue:
import hrequests
proxy = "http://proxy-url:port"
url = "https://www.google.com/"
n = n
for iteration in range(n):
with hrequests.Session() as session:
session.proxy = proxy
response = session.get(url, verify=True)
rendered_response = response.render(mock_human=True)
print(iteration, rendered_response.cookies)
rendered_response.close()
I'm using Bloomberg's Memray to profile the script, and I've observed a significant memory increase over multiple iterations.
For n = 100, the memory profile looks like this:
For n = 400, the memory profile shows a considerable increase:
Is this behavior expected, or could it indicate a memory leak? Any insights would be greatly appreciated.
Open files can be checked by the command below.
lsof -U +c 15 | cut -f1 -d' ' | sort | uniq -c | sort -rn | head -6
Thank you!
OS: Ubuntu 22.04.4 LTS x86_64
Python version: 3.8.10
Python module versions:
hrequests: 0.8.2
rich: 13.7.1
aioprocessing: 2.0.1
click: 8.1.7
faust-cchardet: 2.1.19
selectolax: 0.3.21
orjson: 3.10.5
parse: 1.20.2
urllib3: 1.26.5
httpx: 0.27.0
gevent: 24.2.1
numpy: 1.21.5
geventhttpclient: 2.3.1
async-class: 0.5.0
pycryptodome: 3.20.0
playwright: 1.34.0
playwright-stealth: 1.0.6
greenlet: 2.0.2
pyee: 9.0.4
typing-extensions: 4.11.0
zope.event: 5.0
zope.interface: 6.3
brotli: 1.0.9
certifi: 2024.2.2
sniffio: 1.3.1
httpcore: 1.0.5
anyio: 4.4.0
idna: 3.3
h11: 0.14.0
pygments: 2.18.0
markdown-it-py: 3.0.0
mdurl: 0.1.2
exceptiongroup: 1.2.1
setuptools: 59.6.0
I did not find any information related to this topic in the docs. I have some sites that use a native alert box before loading the full page. I need to click on the accept button in that native dialog in order for the page to load. Is that possible?
Cheers
Hey there! Awesome library! I am running into some issues. I hope the community here can help me troubleshoot them. I am attempting to run hrequests in Lambda to interact with specific web pages when a function URL is called.
I am using the AWS SDK to deploy a Docker container similar to the following to ECR -> Lambda:
FROM mcr.microsoft.com/playwright/python:v1.34.0-jammy
# Include global arg in this stage of the build
ARG FUNCTION_DIR
RUN mkdir -p ${FUNCTION_DIR}
COPY app.py ${FUNCTION_DIR}
WORKDIR /app
COPY ./mytool/pyproject.toml ./mytool/poetry.lock /app/
COPY ./mytool/. /app
# Install dependencies using poetry
RUN pip install --no-cache-dir poetry awslambdaric aws-xray-sdk sh \
&& poetry config virtualenvs.create false \
&& poetry install --no-interaction --no-ansi
RUN python -m playwright install-deps
RUN python -m playwright install
WORKDIR ${FUNCTION_DIR}
ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]
An app.py file similar to the following is then called using said function URL via awslambdaric:
def handler(event, context):
logger.debug(msg=f"Initial event: {event}")
headers = event["headers"]
header_validation = validate_headers(headers)
input = headers["x-input"]
try:
command = headers["x-command"].split()
command.extend(input.split())
except Exception as e:
logger.error(msg=f"Error parsing command: {e}")
return {
"statusCode": 500,
"body": f"Error parsing command: {e}",
}
parsed = []
try:
logger.debug(msg=f"Running command: {command}")
# Set HOME=/tmp to avoid writing to the container filesystem
# Set LD_LIBRARY_PATH to include /usr/lib64 to avoid issues with the AWS X-Ray daemon
os.environ["HOME"] = "/tmp"
os.environ["LD_LIBRARY_PATH"] = "/usr/lib64"
results = subprocess.run(command, capture_output=True, text=True, env=os.environ.copy())
logger.debug(msg=f"Results stdout: {results.stdout}")
logger.debug(msg=f"Results stderr: {results.stderr}")
logger.debug(msg=f"Command exited with code: {results.returncode}")
except subprocess.TimeoutExpired as e:
logger.error(msg=f"Command timed out: {e}")
return {
"statusCode": 408, # HTTP status code for Request Timeout
"body": json.dumps({
"stdout": str(e.stdout),
"stderr": str(e.stderr),
"e": str(e),
"error": "Command timed out"
}),
}
except Exception as e:
logger.error(msg=f"Error executing command: {e}")
return {
"statusCode": 500,
"body": f"Error executing command: {e}",
}
try:
for line in results.stdout.splitlines():
parsed_json = json.loads(line)
logger.debug(msg=f"Output: {parsed_json}")
parsed.append(parsed_json)
except Exception as e:
logger.error(msg=f"Error parsing output: {e}")
return {
"statusCode": 500,
"body": f"Error parsing output: {e}",
}
xray_recorder.end_segment()
return {"statusCode": 200, "body": json.dumps(parsed)}
This app.py code is calling a separate tool I have created that utilizes hrequests for navigation and interaction with web pages. When calling the app.py file with the function URL, however, the following error is returned from hrequests specifically:
Exception in thread Thread-1 (spawn_main):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 128, in spawn_main
asyncio.new_event_loop().run_until_complete(self.main())
File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 135, in main
self.context = await self.client.new_context(
File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/playwright_mock.py", line 38, in new_context
_browser = await context.new_context(
File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/context.py", line 6, in new_context
context = await inst.main_browser.new_context(
File "/usr/local/lib/python3.10/dist-packages/playwright/async_api/_generated.py", line 14154, in new_context
await self._impl_obj.new_context(
File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_browser.py", line 127, in new_context
channel = await self._channel.send("newContext", params)
File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 61, in send
return await self._connection.wrap_api_call(
File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 482, in wrap_api_call
return await cb()
File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 97, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed
Some notes on what has already been attempted:
I am not experienced with playwright and headless browser usage, so any help would be greatly appreciated. I understand this is not directly related to hrequests, but I hope the community here is familiar enough with the frameworks to assist. Thanks!
Hi ,
Thanks for sharing your knowledge .
I get the following error when running the code
python : 3.10
os : windows 11
H:\Python310\ib\site-packages\hrequests\bin\hrequests-cgo-2.1-windows-4.0-amd64.dll is either not designed to run on
Windows or it contains an error. Try installing the program again using the original installation media or contact your
system administrator or the software vendor for support. Error status 0xc0000020.
Hi, I have tried downloading images, the request.content is not raw bytes?
I've tried mangling the r.content to .encode('utf-8') / decoding(uncode-escapes), but to no avail. The response is always some weird mix of encoded and escaped string and bytes.
The Request class implementation is different from requests.Request() and does not support requests.Request.raw and r.content is not Bytes.
Is there any workaround? Cheers
Sample code:
url = f"https://upload.wikimedia.org/wikipedia/commons/5/59/Shrine_of_Rememberance_%2811884180023%29.jpg"
filePath = os.path.join("tmp", f"image.jpg")
while True:
r = hrequests.request("GET", url)
if r.status_code != 403:
data = r.content.encode('utf-8')
with open(filePath, "wb") as f:
f.write(data)
break
Botright v0.3 released today, you can update the playwright_mock.
Especially the Mouse Movement is noticeably improved.
In the requests
library, if the wrong type of encoding is used then it's possible to manually fix this by overriding the request.encoding
attribute.
In hrequests
, this is unfortunately not possible, as doing so leads to the following exception:
request.encoding = 'euc_kr'
AttributeError: property 'encoding' of 'Response' object has no setter
I was wondering if there's any other way to achieve this manual override, as some websites I'm working with are being encoded incorrectly.
Good Morning
OS: Ubuntu 20.04
hrequests version: 0.8.2
By using the hrequests.firefox.BrowserSession() I am receiving the error:
on 0: Exception in thread Thread-2 (spawn_main): Traceback (most recent call last): File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 128, in spawn_main asyncio.new_event_loop().run_until_complete(self.main()) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete return future.result() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 135, in main self.context = await self.client.new_context( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/playwright_mock.py", line 35, in new_context _proxy = await ProxyManager(self, proxy) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/async_class.py", line 173, in __await__ yield from self.create_task( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/proxy_manager.py", line 25, in __ainit__ self.timeout = httpx.Timeout(20.0, read=None) on 0: TypeError: Timeout.__init__() got an unexpected keyword argument 'read' Processing Tags | | โโโ 0/3 [0%] in 7:48 (~0s, 0.0/s)
It's related to the dependency and not with the code.
I am reporting for further analysis.
Have a great day.
Thanks for Library,
Faced this error in Windows
File "C:\Users\Chetan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\hrequests\response.py", line 127, in Response
elapsed: timedelta | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
Fix was to set elapsed = None
. Please add it to library to help people.
How to reproduce issue:
session = hrequests.Session(browser='chrome', version=112)
resp = session.get("https://httpbin.org/headers")
print(resp.json())
It will show a different version in Chrome/114.0.5731.1, Instead of expected Chrome/112.X.X.X
Installing version 0.8.0 or above doesn't work on M1 (or presumably M2) Macs, due to the missing binaries.
/lib/python3.10/site-packages/hrequests/cffi.py", line 98, in download_library
raise IOError('Could not find a matching binary for your system.')
OSError: Could not find a matching binary for your system.
Temporary workaround is to install:
pip install -U "hrequests[all]<0.8.0"
Are there plans to include these binaries? Or perhaps some instructions on how to compile them myself?
allow_redirects is supposedly set as True by default, however it appears by default it is set to False.
Even when I enable it I see a strange behavior, where the redirects are not properly followed
I stumbled on a case that indicates that cookies are not properly set in session.
Example url: https://somo.app
If you open url in browser you can see that first requests returns 307 redirect to '/' and sets cookie, subsequent request is to the same url but with cookie and it returns 200.
Trying to open this url with hrequests will fail with:
hrequests.exceptions.ClientException: failed to do request: Get "/": stopped after 10 redirects
import hrequests
url = "https://somo.app"
session = hrequests.Session()
resp = session.get(url, allow_redirects=True)
if I try to do requests by request
import hrequests
url = "https://somo.app"
session = hrequests.Session()
resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))
resp = session.get(url, allow_redirects=False)
print(resp.status_code)
It shows that subsequent requests returns 307 again. But it should not cookie should be set and second request should return 200.
Getting and setting cookie manually produces expected behavior
import hrequests
url = "https://somo.app"
session = hrequests.Session()
resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))
headers = {"Cookie": ";".join([f"{c.name}={c.value}" for c in resp.cookies])}
resp = session.get(url, allow_redirects=False, headers=headers)
print(resp.status_code)
I'm sorry to write here, but can you contact me about something on discord @max_andolini
Hey, I am using async_get
with hrequests.imap_enum(reqs)
to scrape 100 URLs at a time. The readme
says this should be blazing fast, but I'm not sure what that means. It's currently taking 3-5 minutes, and that does not include a render step.
Is that approximately the amount of time it should take? I was thinking it'd be a lot faster since it's just a request and not a rendering. Here's the code I'm currently using. The rows
variable is 100 DB records, specifically a SQL Alchemy RowMapping
object.
reqs = [hrequests.async_get(r.url) for r in rows]
responses = []
for index, resp in hrequests.imap_enum(reqs):
if resp:
with open(f'{directory}/{index}.pickle', 'wb') as file:
pickle.dump(resp, file)
responses.append({"url": resp.url, "resp": resp})
else:
print(f'No response for {index}')
This issue serves as a todo item for migrating from PyQuery to selectolax, a faster, more modern and capable HTML parsing library.
I think screenshots should be taken for specific elements instead of the entire page and used in memory without saving.
Like this;
image_element = page.find('#captchaframe')
image = image_element.screenshot()
OS: Windows 11.0
Python version: 3.11.4
When installing the "hrequests" package using the pip command and executing some related code, some specific errors are observed, such as:
It is noticeable that, even when using only the "get" function provided by the "hrequests" package, the "bs4" library is requested, and even after trying to install the "hrequests[all]" package, the same errors persist. However, when the "bs4" library is installed manually, the code works without problems. This is a rather inconvenient problem to solve every time I use this library in another project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.