Code Monkey home page Code Monkey logo

Comments (7)

coletdjnz avatar coletdjnz commented on June 13, 2024

Impersonation is only supported by the native downloader. yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.

Impersonation is not just the user agent, it makes the whole HTTP and TLS fingerprint look exactly like the browser, which is not possible in aria2c.

from yt-dlp.

barkoder avatar barkoder commented on June 13, 2024

Okay. bit of a long description because I don't want to link to the actual website but basically,

yt-dlp does not work on the webpage in question.

yt-dlp native downloader(with --impersonate="" ) works on the webpage. But it's painfully slow.

yt-dlp(with --impersonate="") with aria2c, successfully parses the webpage but when actually downloading the video file from the CDN with aria2c, it fails. Forbidden.

So I initially thought maybe the website had implemented some sort of TLS fingerprint check on their CDN as well, in addition to the webpage frontend(in which case I'd be stuck using yt-dlp's slower native downloader). OR...maybe merely passing the headers that was used on the frontend webpage to the CDN would work.

Thankfully it was the latter.

To confirm this, I visited the website manually using a real browser. Grabbed the relevant cookies and headers of the video file(hosted on the CDN) from devtools. Converted them to an aria2c compatible command(--header). And the file successfully downloads using aria2c. Basically, if you used say Chrome-110 to interact with the front-end, the website expects you to use Chrome-110 to interact with the backend CDN as well.

The reason I created this issue is because if yt-dlp had access to(and showed with -v) which client/OS combo it picked when using impersonate="" , yt-dlp could pass on the corresponding headers to any external downloaders. I'm aware TLS fingerprint spoofing isn't possible when using external downloaders, but even if yt-dlp could merely pass all the correct headers to the external downloader, the aria2c download works(and is much faster).

For example, if I manually select yt-dlp --impersonate=chrome-104:Windows-10 , yt-dlp should pass on all the corresponding headers from curl_chrome104 to aria2c when downloading the actual file from whichever CDN the website connects me to.

Currently afaik there is no way to do it properly in yt-dlp because even when you manually specify all the chrome-104 specific headers in --downloader-args , yt-dlp insists on overriding this by passing on one of its own user-agent strings from yt-dlp --dump-user-agent to aria2c resulting in TWO user-agents being passed on to aria2c, thereby failing the download.

This

yt-dlp -v\
--impersonate="chrome-104:Windows-10"\
https://www.youtube.com/watch?v=Oarf76MCrss -f 313\
--downloader aria2c\
--downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'"

will give you

[debug] aria2c command line: aria2c -c --no-conf "--console-log-level=warn" "--summary-interval=0" "--download-result=hide" "--http-accept-gzip=true" "--file-allocation=none" -x16 -j16 -s16 --min-split-size 1M --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.70 Safari/537.36" --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" --header "Accept-Language: en-us,en;q=0.5" --header "Sec-Fetch-Mode: navigate" --check-certificate=true" "--remote-time=true" "--show-console-readout=true" --header "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36" --out ".\\Tom Scott-20230724-Grizzly bear GoPro selfie raw unedited footage-Oarf76MCrss.webm.part" "--auto-file-renaming=false"

Two User-Agents!!

To check which user-agent is being read by the website, run the above aria2c command with a slight modification -x1 -j1 -s1 --out test.txt https://myip.wtf/headers . And then cat test.txt

You'll see that user-agent string that gets read by the website is always the first one sent(which is from yt-dlp --dump-user-agent). Even though I've explicitly specified --downloader-args aria2c:"--header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'" . And I don't know of a way to have my own user-agent supersede yt-dlp's aria2c user-agent.

The website in question is thankfully not doing any TLS fingerprint verification when connecting to the CDN at the moment. Merely passing the proper cookies/headers works. But yt-dlp is currently not able to do that.

How does yt-dlp randomly pick a client/OS when I specify impersonate="" ? Is that random pick even done by yt-dlp? If yes, yt-dlp could internally have a copy of the all the headers of all of these and also these for each of the client/OS combos. And pass that on to the external downloader properly(sans TLS fingerprint spoofing of course) depending on which --impersonate="" client was randomly picked.

Thanks!

PS: Sorry for repeating myself in some places.

from yt-dlp.

bashonly avatar bashonly commented on June 13, 2024

IMO the site just needs a dedicated extractor with built-in impersonation support and it can return the user-agent header in the info dict

from yt-dlp.

pukkandan avatar pukkandan commented on June 13, 2024

Currently afaik there is no way to do it properly in yt-dlp

--user-agent "..." should work instead of --downloader-args

from yt-dlp.

barkoder avatar barkoder commented on June 13, 2024

Thanks @pukkandan. That works! (for now at least).

Although I can't use --impersonate="" due to the random client/OS combo and me not knowing which --user-agent I should use beforehand.
I must pick one now(like --impersonate=chrome-107:windows-10) and manually specify its corresponding user-agent(--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36") .

Basically the fix I'm asking for is

If --downloader some_external_downloader and --impersonate="" is specified, set --user-agent='USER_AGENT_OF_WHATEVER_CLIENT/OS_WAS_RANDOMLY_PICKED_FROM --impersonate=""'

Also I think it would work for all supported websites. And won't break anything? Unless I'm mistaken.

Feel free to close this issue if you don't intend on fixing it atm, but I have a feeling that as websites further harden their anti-bot measures, this issue would likely reoccur.

Still, thanks!

from yt-dlp.

bashonly avatar bashonly commented on June 13, 2024

yt-dlp does not know of the user agent - that is handled internally by curl-impersonate.

^ this is the problem. The impersonate user-agents are only known by curl-cffi. It's not feasible to hardcode them in yt-dlp (e.g. the exact user-agent strings may even change between versions of curl-cffi).

What is feasible, as I said above, is to add a dedicated extractor for the site in question, where we could hardcode a single impersonate target and user-agent.

from yt-dlp.

barkoder avatar barkoder commented on June 13, 2024

Thanks @bashonly, but it's alright. I'll make do with manually specifying the user-agent.

Closing the issue.

from yt-dlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.