DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE <li cla

Okay. bit of a long deion because I don't want to link to the actual website but

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

aria2c is not using the user-agents from the impersonate targets. about yt-dlp HOT 7 CLOSED

barkoder commented on June 13, 2024

aria2c is not using the user-agents from the impersonate targets.

from yt-dlp.

Comments (7)

coletdjnz commented on June 13, 2024

Impersonation is only supported by the native downloader. yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.

Impersonation is not just the user agent, it makes the whole HTTP and TLS fingerprint look exactly like the browser, which is not possible in aria2c.

from yt-dlp.

barkoder commented on June 13, 2024

Okay. bit of a long description because I don't want to link to the actual website but basically,

yt-dlp does not work on the webpage in question.

yt-dlp native downloader(with --impersonate="" ) works on the webpage. But it's painfully slow.

yt-dlp(with --impersonate="") with aria2c, successfully parses the webpage but when actually downloading the video file from the CDN with aria2c, it fails. Forbidden.

So I initially thought maybe the website had implemented some sort of TLS fingerprint check on their CDN as well, in addition to the webpage frontend(in which case I'd be stuck using yt-dlp's slower native downloader). OR...maybe merely passing the headers that was used on the frontend webpage to the CDN would work.

Thankfully it was the latter.

To confirm this, I visited the website manually using a real browser. Grabbed the relevant cookies and headers of the video file(hosted on the CDN) from devtools. Converted them to an aria2c compatible command(--header). And the file successfully downloads using aria2c. Basically, if you used say Chrome-110 to interact with the front-end, the website expects you to use Chrome-110 to interact with the backend CDN as well.

The reason I created this issue is because if yt-dlp had access to(and showed with -v) which client/OS combo it picked when using impersonate="" , yt-dlp could pass on the corresponding headers to any external downloaders. I'm aware TLS fingerprint spoofing isn't possible when using external downloaders, but even if yt-dlp could merely pass all the correct headers to the external downloader, the aria2c download works(and is much faster).

For example, if I manually select yt-dlp --impersonate=chrome-104:Windows-10 , yt-dlp should pass on all the corresponding headers from curl_chrome104 to aria2c when downloading the actual file from whichever CDN the website connects me to.

Currently afaik there is no way to do it properly in yt-dlp because even when you manually specify all the chrome-104 specific headers in --downloader-args , yt-dlp insists on overriding this by passing on one of its own user-agent strings from yt-dlp --dump-user-agent to aria2c resulting in TWO user-agents being passed on to aria2c, thereby failing the download.

This

yt-dlp -v\
--impersonate="chrome-104:Windows-10"\
https://www.youtube.com/watch?v=Oarf76MCrss -f 313\
--downloader aria2c\
--downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'"

will give you

[debug] aria2c command line: aria2c -c --no-conf "--console-log-level=warn" "--summary-interval=0" "--download-result=hide" "--http-accept-gzip=true" "--file-allocation=none" -x16 -j16 -s16 --min-split-size 1M --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.70 Safari/537.36" --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" --header "Accept-Language: en-us,en;q=0.5" --header "Sec-Fetch-Mode: navigate" --check-certificate=true" "--remote-time=true" "--show-console-readout=true" --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" --out ".\\Tom Scott-20230724-Grizzly bear GoPro selfie raw unedited footage-Oarf76MCrss.webm.part" "--auto-file-renaming=false"

Two User-Agents!!

To check which user-agent is being read by the website, run the above aria2c command with a slight modification -x1 -j1 -s1 --out test.txt https://myip.wtf/headers . And then cat test.txt

You'll see that user-agent string that gets read by the website is always the first one sent(which is from yt-dlp --dump-user-agent). Even though I've explicitly specified --downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'" . And I don't know of a way to have my own user-agent supersede yt-dlp's aria2c user-agent.

The website in question is thankfully not doing any TLS fingerprint verification when connecting to the CDN at the moment. Merely passing the proper cookies/headers works. But yt-dlp is currently not able to do that.

How does yt-dlp randomly pick a client/OS when I specify impersonate="" ? Is that random pick even done by yt-dlp? If yes, yt-dlp could internally have a copy of the all the headers of all of these and also these for each of the client/OS combos. And pass that on to the external downloader properly(sans TLS fingerprint spoofing of course) depending on which --impersonate="" client was randomly picked.

Thanks!

PS: Sorry for repeating myself in some places.

from yt-dlp.

bashonly commented on June 13, 2024

IMO the site just needs a dedicated extractor with built-in impersonation support and it can return the user-agent header in the info dict

from yt-dlp.

pukkandan commented on June 13, 2024

Currently afaik there is no way to do it properly in yt-dlp

--user-agent "..." should work instead of --downloader-args

from yt-dlp.

barkoder commented on June 13, 2024

Thanks @pukkandan. That works! (for now at least).

Although I can't use --impersonate="" due to the random client/OS combo and me not knowing which --user-agent I should use beforehand.
I must pick one now(like --impersonate=chrome-107:windows-10) and manually specify its corresponding user-agent(--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .

Basically the fix I'm asking for is

If --downloader some_external_downloader and --impersonate="" is specified, set --user-agent='USER_AGENT_OF_WHATEVER_CLIENT/OS_WAS_RANDOMLY_PICKED_FROM --impersonate=""'

Also I think it would work for all supported websites. And won't break anything? Unless I'm mistaken.

Feel free to close this issue if you don't intend on fixing it atm, but I have a feeling that as websites further harden their anti-bot measures, this issue would likely reoccur.

Still, thanks!

from yt-dlp.

bashonly commented on June 13, 2024

yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.

^ this is the problem. The impersonate user-agents are only known by curl-cffi. It's not feasible to hardcode them in yt-dlp (e.g. the exact user-agent strings may even change between versions of curl-cffi).

What is feasible, as I said above, is to add a dedicated extractor for the site in question, where we could hardcode a single impersonate target and user-agent.

from yt-dlp.

barkoder commented on June 13, 2024

Thanks @bashonly, but it's alright. I'll make do with manually specifying the user-agent.

Closing the issue.

from yt-dlp.

aria2c is not using the user-agents from the impersonate targets. about yt-dlp HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent