Code Monkey home page Code Monkey logo

cf-clearance-scraper's Introduction

CF-Clearance-Scraper

Playwright Version

A simple program for scraping Cloudflare clearance (cf_clearance) cookies from websites issuing Cloudflare challenges to visitors. This program works on all Cloudflare challenge types (JavaScript, managed, and interactive). If you would prefer using undetected-chromedriver, check out the undetected-chromedriver version.

Note This program currently will not be able to solve turnstile challenges due to an issue with Playwright. For more information, see microsoft/playwright#21780. As a temporary solution, pass the -d flag and solve the challenge manually.

Clearance Cookie Usage

In order to bypass Cloudflare challenges with the clearance cookies, you must make sure of two things:

  • The user agent used to fetch the clearance cookie must match the user agent being used within the requests that use the clearance cookie

    Note The default user agent used by the scraper is Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0.

  • The IP address used to fetch the clearance cookie must match the IP address being used to make the requests that use the clearance cookie
flowchart
	N14e["cf_clearance"]
	N14f["IP Address"]
	N150["User Agent"]
	N14e --> N14f
	N14e --> N150
Loading

Installation

$ pip install -r requirements.txt
$ python -m playwright install --with-deps firefox

Usage

Note Depending on the user agent used, it may affect your ability to solve the Cloudflare challenge.

usage: main.py [-h] [-f FILE] [-t TIMEOUT] [-p PROXY] [-ua USER_AGENT] [--disable-http2] [--disable-http3] [-d] [-v] URL

A simple program for scraping Cloudflare clearance (cf_clearance) cookies from websites issuing Cloudflare challenges to visitors

positional arguments:
  URL                   The URL to scrape the Cloudflare clearance cookie from

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  The file to write the Cloudflare clearance cookie information to, in JSON format
  -t TIMEOUT, --timeout TIMEOUT
                        The timeout in seconds to use for browser actions and solving challenges
  -p PROXY, --proxy PROXY
                        The proxy server URL to use for the browser requests (SOCKS5 proxy authentication is not supported)
  -ua USER_AGENT, --user-agent USER_AGENT
                        The user agent to use for the browser requests
  --disable-http2       Disable the usage of HTTP/2 for the browser requests
  --disable-http3       Disable the usage of HTTP/3 for the browser requests
  -d, --debug           Run the browser in headed mode
  -v, --verbose         Increase the output verbosity

Example

$ python main.py -v -f cookies.json https://nowsecure.nl
[11:33:32] [INFO] Launching headless browser...
[11:33:34] [INFO] Going to https://nowsecure.nl...
[11:33:34] [INFO] Solving Cloudflare challenge [Managed]...
[11:33:38] [INFO] Cookie: cf_clearance=SNMwlsKbfROOWr3FU0jgPn0WY3.z1sn5_b3W6aSRwh8-1690648414-0-160.0.0
[11:33:38] [INFO] User agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0
[11:33:38] [INFO] Writing Cloudflare clearance cookie information to cookies.json...

cf-clearance-scraper's People

Contributors

deepsource-io[bot] avatar dependabot[bot] avatar xewdy444 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cf-clearance-scraper's Issues

why can't run in docker container

the docker image is the following one

mcr.microsoft.com/playwright/python

you can try it , and you will get

root@c274d227588f:/home# python main.py -u https://nowsecure.nl -f cookies.json -v
[13:19:00] [INFO] Launching headless browser...
[13:19:01] [INFO] Going to https://nowsecure.nl...
[13:19:02] [INFO] Solving cloudflare challenge [JavaScript]...
[13:19:30] [ERROR] Failed to retrieve cf_clearance cookie.

Playwright browser can not get through site even after solving

Hey, this was working well about a couple of days ago. It would be able to go to the site, detect the challenge and solve it then continue to a different page. At the moment it will just auto-solve the challenge and repeat the process. I attached some images and a video for visual context.

After the solution has been done it will log this error:
image

Vid of the browser repeating the solve instead of getting through to the site:
https://github.com/Xewdy444/CF-Clearance-Scraper/assets/93611007/c61c100b-fd7f-4a56-8d1c-37623c471ea3

Failed to retrieve the cf_clearence cookie in roobet.com

Hello, for some reason, it detects the challenge in roobet, but cant solve it, any idea why?
C:\Users\Usuario\Downloads\CFScraperOB2>python main.py -u https://roobet.com/ -v
[17:18:10] Checking for cloudflare challenge...
[17:18:10] Cloudflare challenge detected. Fetching cf_clearance cookie...
[17:18:11] Launching headless browser...
[17:18:11] Going to https://roobet.com/...
[17:18:13] Failed to retrieve cf_clearance cookie.

help with user agent

So i'm trying to get the cf-cookie from a site and if I use the default user agent it gets the code no issues

However it only works works with firefox useragents. If I try to use a chrome useragent it doesn't work.
The useragent I need to use is

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

No issues passing the verification in the chrome browser itself so I don't know why this script only works with firefox user agents

All my scraping scripts are set to use a portable chrome browser and driver so it has to work with the above user agent

Any advice?

[ERROR]

(No Cloudflare challenge detected.) hello Xewdy444, any method to resolve?

failed on nhentai.net

  • OS: ubuntu22.04
  • python3.10.12
python3 main.py -v -f cookies.json https://nhentai.net/
[03:30:22] [INFO] Launching headless browser...
[03:30:23] [INFO] Going to https://nhentai.net/...
[03:30:24] [INFO] Solving Cloudflare challenge [Managed]...
[03:30:55] [ERROR] Failed to retrieve a Cloudflare clearance cookie.

Failed to retrieve cf_clearance cookie.

I get this error message when trying to run the main.py file. I do have the playwright 1.28 version:

$ python3.9 main.py -u https://nowsecure.nl -f cookies.json -v
[23:27:32] [INFO] Launching headless browser...
[23:27:35] [INFO] Going to https://nowsecure.nl...
[23:27:35] [INFO] Solving cloudflare challenge [Managed]...
[23:28:18] [ERROR] Failed to retrieve cf_clearance cookie.

help needed

i have problem on adding this solver in my bot would you please help me on it because am facing turnstile hcaptcha in my target website

No Cloudflare challenge detected

Getting the message "No Cloudflare challenge detected." when using the scraper.

Command: python3 main.py -v -f cookies.json https://www.ebgames.co.nz

The site does have Cloudflare protection and I get a cf_clearance cookie browsing normally so not sure why this is happening.

Any help appreciated.

error

10:10:21] [INFO] Solving Cloudflare challenge [Managed]...
[10:17:36] [ERROR] Timeout 15000ms exceeded.
=========================== logs ===========================
checking visibility of locator("#challenge-spinner")

[10:17:36] [ERROR] Failed to retrieve the Cloudflare clearance cookie.

useragent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

how fix this?

error

hey - this used to work well before, but is now failing consistently. Wondering if you are facing the same problem too?

Tried in headful mode too. See this failure:
image

Issues changing User Agent

Hello, most of the times, when i want to change an user agent, it gets a handshake timeout
This is an example, using the default UA, and using Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36

C:\Users\Usuario\Downloads\CFScraperOB2>python main.py -u https://www.petflow.com/ -v
[15:20:52] Checking for cloudflare challenge...
[15:20:52] Cloudflare challenge detected. Fetching cf_clearance cookie...
[15:20:54] Launching headless browser...
[15:20:56] Going to https://www.petflow.com/...
[15:20:57] Solving cloudflare challenge [Managed]...
[15:20:58] Cookie: cf_clearance=ey4oPPQnPTDdZ6LHuC3tg_BiVHJV.BTJCsiUHt7TG78-1668882058-0-150

C:\Users\Usuario\Downloads\CFScraperOB2>python main.py -u https://www.petflow.com/ -v
[15:35:19] Checking for cloudflare challenge...
[15:35:34] _ssl.c:1108: The handshake operation timed out

[ERROR] Execution context was destroyed, most likely because of a navigation

But isn't it the normal behavior for CF checks? At least for me it reloads the page whether it lets me through or not even if I'm visiting CF-protected pages normally outside this tool. So it's expected that after passing the checks the page could be navigated away, because as far as I know, CF passes some data through the query string appended to the current URL.

chromedriver keeps clicking checkbox

I was testing with chromedriver on commit 57c1696. I did fix TURNSTILE_FRAME with correct XPath to find the new iframe on site https://nowsecure.nl. When running it with -d option, I can see it keeps clicking the checkbox but it cannot pass the validation. I tried different versions of Chrome but no luck with that. Is there any way to successfully get the cookie by using chromedriver?

'playwright' module missing

from playwright._impl._api_types import Error as PlaywrightError
ModuleNotFoundError: No module named 'playwright'

Am I missing an update?

Doesnt click the checkbox

The driver does not detect the checkbox, and therefore doesn't click it and fails... Also doesn't detect that it is interactive one which has to be clicked... Both on chrome driver and the other one...

playwright import error

Was using this just fine before but suddenly got the following error.

Command: py main.py -v -f cookies.json https://flipd.gg

Error:
Traceback (most recent call last):
File "C:\Users\oSana\Desktop\Development\Python\Projects\FlipdBumper\CF-Clearance-Scraper\main.py", line 12, in
from playwright.sync_api import Frame, sync_playwright
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\sync_api_init_.py", line 25, in
import playwright.sync_api._generated
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\sync_api_generated.py", line 25, in
from playwright._impl._accessibility import Accessibility as AccessibilityImpl
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright_impl_accessibility.py", line 17, in
from playwright._impl.connection import Channel
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright_impl_connection.py", line 35, in
from pyee import EventEmitter
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee_init
.py",
line 120, in
from pyee.trio import TrioEventEmitter as TrioEventEmitter # noqa
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\pyee\trio.py", line 7, in
import trio
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\trio_init
.py",
line 19, in
from .core import TASK_STATUS_IGNORED as TASK_STATUS_IGNORED # isort: skip
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\trio_core_init
.py", line 9, in
from ._entry_queue import TrioToken
File "C:\Users\oSana\AppData\Local\Programs\Python\Python310\lib\site-packages\trio_core_entry_queue.py", line 129, in
@attr.s(eq=False, hash=False, slots=True)
TypeError: attrs() got an unexpected keyword argument 'eq'

Feature request

Is your feature request related to a problem? Please describe.
Hello, i need a cf_clearance for each proxy i send, but they don't contain the challenge, hence i can't get the cookies to keep my program working

Describe the solution you'd like
A mode maybe, where if active, every request will have a challenge

Describe alternatives you've considered
I can do this using a network sniffer, once cloudflare sees it's capturing info, they always throw the challenge, but not in proxies, since the sniffer is not active there, i was wondering if the program could disguise as a sniffer so it throws the challenge

Additional context
.

import Error as PlaywrightError

I got this error log:
Traceback (most recent call last):
File "main.py", line 11, in
from playwright._impl._api_types import Error as PlaywrightError

I'm using python 3.8, could you tell me which version of python is recommended

Error in js solving

C:\Users\Usuario\Downloads\CF-Clearance-Scraper>python main.py -u https://www.bang.com/ -f cookies.txt -v
[00:33:07] Checking for cloudflare challenge...
[00:33:08] Cloudflare challenge detected. Fetching cf_clearance cookie...
[00:33:08] Launching headless browser...
[00:33:09] Going to https://www.bang.com/...
[00:33:10] Solving cloudflare challenge [JavaScript]...
Traceback (most recent call last):
File "main.py", line 208, in
main()
File "main.py", line 185, in main
cookies = get_cookies(args)
File "main.py", line 100, in get_cookies
solve_challenge(page)
File "main.py", line 49, in solve_challenge
verify_button = page.get_by_role("button", name=verify_button_pattern)
AttributeError: 'Page' object has no attribute 'get_by_role'

[Feature request] Show source data

Hello, sorry for making all these requests, i'd love to help you, you can contact me if you need anything
would it be possible to add an option to show the whole source data from the page?

Not pass checkbox

In the most recent commit, the checkbox did not pass. I’ve investigated your code and found that it does not detect the turnstile iframe.

Please check this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.