Comments (7)
Impersonation is only supported by the native downloader. yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.
Impersonation is not just the user agent, it makes the whole HTTP and TLS fingerprint look exactly like the browser, which is not possible in aria2c.
from yt-dlp.
Okay. bit of a long description because I don't want to link to the actual website but basically,
yt-dlp does not work on the webpage in question.
yt-dlp native downloader(with --impersonate=""
) works on the webpage. But it's painfully slow.
yt-dlp(with --impersonate=""
) with aria2c, successfully parses the webpage but when actually downloading the video file from the CDN with aria2c, it fails. Forbidden.
So I initially thought maybe the website had implemented some sort of TLS fingerprint check on their CDN as well, in addition to the webpage frontend(in which case I'd be stuck using yt-dlp's slower native downloader). OR...maybe merely passing the headers that was used on the frontend webpage to the CDN would work.
Thankfully it was the latter.
To confirm this, I visited the website manually using a real browser. Grabbed the relevant cookies and headers of the video file(hosted on the CDN) from devtools. Converted them to an aria2c compatible command(--header
). And the file successfully downloads using aria2c. Basically, if you used say Chrome-110 to interact with the front-end, the website expects you to use Chrome-110 to interact with the backend CDN as well.
The reason I created this issue is because if yt-dlp had access to(and showed with -v
) which client/OS combo it picked when using impersonate=""
, yt-dlp could pass on the corresponding headers to any external downloaders. I'm aware TLS fingerprint spoofing isn't possible when using external downloaders, but even if yt-dlp could merely pass all the correct headers to the external downloader, the aria2c download works(and is much faster).
For example, if I manually select yt-dlp --impersonate=chrome-104:Windows-10
, yt-dlp should pass on all the corresponding headers from curl_chrome104 to aria2c when downloading the actual file from whichever CDN the website connects me to.
Currently afaik there is no way to do it properly in yt-dlp because even when you manually specify all the chrome-104 specific headers in --downloader-args
, yt-dlp insists on overriding this by passing on one of its own user-agent strings from yt-dlp --dump-user-agent
to aria2c resulting in TWO user-agents being passed on to aria2c, thereby failing the download.
This
yt-dlp -v\
--impersonate="chrome-104:Windows-10"\
https://www.youtube.com/watch?v=Oarf76MCrss -f 313\
--downloader aria2c\
--downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'"
will give you
[debug] aria2c command line: aria2c -c --no-conf "--console-log-level=warn" "--summary-interval=0" "--download-result=hide" "--http-accept-gzip=true" "--file-allocation=none" -x16 -j16 -s16 --min-split-size 1M --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.70 Safari/537.36" --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" --header "Accept-Language: en-us,en;q=0.5" --header "Sec-Fetch-Mode: navigate" --check-certificate=true" "--remote-time=true" "--show-console-readout=true" --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" --out ".\\Tom Scott-20230724-Grizzly bear GoPro selfie raw unedited footage-Oarf76MCrss.webm.part" "--auto-file-renaming=false"
Two User-Agents!!
To check which user-agent is being read by the website, run the above aria2c command with a slight modification -x1 -j1 -s1 --out test.txt https://myip.wtf/headers
. And then cat test.txt
You'll see that user-agent string that gets read by the website is always the first one sent(which is from yt-dlp --dump-user-agent
). Even though I've explicitly specified --downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'"
. And I don't know of a way to have my own user-agent supersede yt-dlp's aria2c user-agent.
The website in question is thankfully not doing any TLS fingerprint verification when connecting to the CDN at the moment. Merely passing the proper cookies/headers works. But yt-dlp is currently not able to do that.
How does yt-dlp randomly pick a client/OS when I specify impersonate=""
? Is that random pick even done by yt-dlp? If yes, yt-dlp could internally have a copy of the all the headers of all of these and also these for each of the client/OS combos. And pass that on to the external downloader properly(sans TLS fingerprint spoofing of course) depending on which --impersonate=""
client was randomly picked.
Thanks!
PS: Sorry for repeating myself in some places.
from yt-dlp.
IMO the site just needs a dedicated extractor with built-in impersonation support and it can return the user-agent header in the info dict
from yt-dlp.
Currently afaik there is no way to do it properly in yt-dlp
--user-agent "..."
should work instead of --downloader-args
from yt-dlp.
Thanks @pukkandan. That works! (for now at least).
Although I can't use --impersonate=""
due to the random client/OS combo and me not knowing which --user-agent
I should use beforehand.
I must pick one now(like --impersonate=chrome-107:windows-10
) and manually specify its corresponding user-agent(--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
) .
Basically the fix I'm asking for is
If --downloader some_external_downloader
and --impersonate=""
is specified, set --user-agent='USER_AGENT_OF_WHATEVER_CLIENT/OS_WAS_RANDOMLY_PICKED_FROM --impersonate=""'
Also I think it would work for all supported websites. And won't break anything? Unless I'm mistaken.
Feel free to close this issue if you don't intend on fixing it atm, but I have a feeling that as websites further harden their anti-bot measures, this issue would likely reoccur.
Still, thanks!
from yt-dlp.
yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.
^ this is the problem. The impersonate user-agents are only known by curl-cffi. It's not feasible to hardcode them in yt-dlp (e.g. the exact user-agent strings may even change between versions of curl-cffi).
What is feasible, as I said above, is to add a dedicated extractor for the site in question, where we could hardcode a single impersonate target and user-agent.
from yt-dlp.
Thanks @bashonly, but it's alright. I'll make do with manually specifying the user-agent.
Closing the issue.
from yt-dlp.
Related Issues (20)
- "orf:on saves audio track with wrong language" - how to report this properly? HOT 3
- Does anyone have all the historical videos of 하루S2 sol3712 ? I want to buy them for free. HOT 1
- twitter changed domain name to x.com HOT 1
- Attempt to download from single Patreon membership instead crawls all memberships available to user HOT 4
- x.com (twitter) gives "unsupported URL" HOT 1
- twitch subscriber only VODs HOT 2
- convertsubtitles problem HOT 2
- Beeg broken HOT 1
- New (5/23/24) NPR media player not supported HOT 1
- can not download private youtube videos HOT 1
- Broken Json Metadata download HOT 1
- [MLBTV] 745415: Unable to download JSON metadata: HTTP Error 400: Bad Request (caused by <HTTPError 400: Bad Request>)
- [Error] [brightcove:new] Unable to download webpage: timed out (caused by TransportError('timed out')) HOT 2
- YT HOT 1
- TubiTV HTTP Error 401: Unauthorized HOT 1
- Metadata Correction for HLS Stream Video HOT 1
- YouTube subtitles only download is BROKEN HOT 2
- Some TikTok user pages get stuck in an infinite loop HOT 2
- Tiktok Photomode Photoslide downloader HOT 1
- Brollie: watch.brollie.com.au (Umbrella Entertainment's free streaming catalogue) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yt-dlp.