Code Monkey home page Code Monkey logo

yt-fts's Introduction

yt-fts - Youtube Full Text Search

yt-fts is a command line program that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line. It allows you to query a channel for specific key word or phrase and will generate time stamped youtube urls to the video containing the keyword.

It also supports semantic search via the OpenAI embeddings API using chromadb.

demo.mp4

Installation

pip install yt-fts

yt-dlp dependency:

This project requires yt-dlp installed globally. Platform specific installation instructions are available on the yt-dlp wiki.

# MacOS/Homebrew
brew install yt-dlp
# Windows/winget
winget install yt-dlp
# pip
python3 -m pip install -U yt-dlp

download

Download subtitles for a channel.

Takes a channel url or id as an argument. Specify the number of jobs to parallelize the download with the --number-of-jobs option.

yt-fts download --number-of-jobs 5 "https://www.youtube.com/@3blue1brown"

list

List saved channels.

The (ss) next to the channel name indicates that the channel has semantic search enabled.

yt-fts list
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ID ┃ Name                  ┃ Count ┃ Channel ID               ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1  │ ChessPage1 (ss)       │ 19    │ UCO2QPmnJFjdvJ6ch-pe27dQ │
│ 2  │ 3Blue1Brown           │ 127   │ UCYO_jab_esuFRV4b17AJtAw │
│ 3  │ george hotz archive   │ 410   │ UCwgKmJM4ZJQRJ-U5NjvR2dg │
│ 4  │ The Tim Dillon Show   │ 288   │ UC4woSp8ITBoYDmjkukhEhxg │
│ 5  │ Academy of Ideas (ss) │ 190   │ UCiRiQGCHGjDLT9FQXFW0I3A │
└────┴───────────────────────┴───────┴──────────────────────────┘

search (Full Text Search)

Full text search for a string in saved channels.

  • The search string does not have to be a word for word and match
  • Search strings are limited to 40 characters.
# search in all channels
yt-fts search "[search query]" 

# search in channel 
yt-fts search "[search query]" --channel "[channel name or id]" 

# search in specific video
yt-fts search "[search query]" --video "[video id]"

# limit results 
yt-fts search "[search query]" --limit "[number of results]" --channel "[channel name or id]"

# export results to csv
yt-fts search "[search query]" --export --channel "[channel name or id]" 

Advanced Search Syntax:

The search string supports sqlite Enhanced Query Syntax. which includes things like prefix queries which you can use to match parts of a word.

# AND search
yt-fts search "knife AND Malibu" --channel "The Tim Dillon Show" 

# OR SEARCH 
yt-fts search "knife OR Malibu" --channel "The Tim Dillon Show" 

# wild cards
yt-fts search "rea* kni* Mali*" --channel "The Tim Dillon Show" 

Semantic Search

You can enable semantic search for a channel by using the get-embeddings command. This requires an OpenAI API key set in the environment variable OPENAI_API_KEY, or you can pass the key with the --openai-api-key flag.

get-embedings

Fetches OpenAI embeddings for specified channel

# make sure openAI key is set
# export OPENAI_API_KEY="[yourOpenAIKey]"

yt-fts get-embeddings --channel "3Blue1Brown"

After the embeddings are saved you will see a (ss) next to the channel name when you list channels and you will be able to use the vsearch command for that channel.

vsearch (Semantic Search)

vsearch is for "Vector search". This requires that you enable semantic search for a channel with get-embeddings. It has the same options as search but output will be sorted by similarity to the search string and the default return limit is 10.

# search by channel name
yt-fts vsearch "[search query]" --channel "[channel name or id]"

# search in specific video
yt-fts vsearch "[search query]" --video "[video id]"

# limit results 
yt-fts vsearch "[search query]" --limit "[number of results]" --channel "[channel name or id]"

# export results to csv
yt-fts vsearch "[search query]" --export --channel "[channel name or id]" 

How To

Export search results: For both the search and vsearch commands you can export the results to a csv file with the --export flag. and it will save the results to a csv file in the current directory.

yt-fts search "life in the big city" --export
yt-fts vsearch "existing in large metropolaten center" --export

Delete a channel: You can delete a channel with the delete command.

yt-fts delete --channel "3Blue1Brown"

Update a channel: The update command currently only works for full text search and will not update the semantic search embeddings.

yt-fts update --channel "3Blue1Brown"

Export all of a channel's transcript: This command will create a directory in current working directory with the youtube channel id of the specified channel.

# Export to vtt
yt-fts export --channel "[id/name]" --format "[vtt/txt]"

yt-fts's People

Contributors

cherrries avatar danlamanna avatar dimakov avatar notjoemartinez avatar teddybear06 avatar tonym128 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yt-fts's Issues

Prevent duplicate subtitle entries in db

The current way we parse vtt files inserts duplicate quote entries with time stamp off by a couple seconds. This is because the vtt files we get from yt-dlp contain duplicate entries except one of them has a bunch of markup to segment the quote. See line 192. Removing these duplicates would probably speed something up

[Feature request] Allow searching only some videos in channel

This is an alternative to #18 to achieve similar goals.

It would be nice to be able to supply a regex on video titles as well as searching for content.

Using Lex Fridman's channel as an example:

His podcast has 376 videos: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4

However his "channel" has 689 videos: https://www.youtube.com/@lexfridman/videos

After downloading the channel content and querying through the episodes, a regex of /(Podcast)(?! Clips)/ will return all his podcast episodes but none of the other content.

This is obviously not as reliable as allowing a playlist URL but it might be a handy feature nonetheless and would seemingly only involve adjusting the search command with a new flag.

Implement sqlite_utils full-text search

from pr #17

As suggested on HN, yt-fts is currently using LIKE operator for searches.
The goal here is to leverage the SQLite FTS5 full-text search using sqlite_utils library.

HN suggestion:

It looks like you're running searches using LIKE: https://github.com/NotJoeMartinez/yt-fts/blob/050981c0519a96...

SQLite has a really power full-text search mechanism built in - FTS5. It can handle things like stemming and stop words and relevance ranking.

My sqlite-utils Python library includes helper methods for setting that up: https://sqlite-utils.datasette.io/en/stable/python-api.html#...

Seach across channels

It would be nice if it were possible to search across all downloaded channels.
Maybe with an --all flag?

Missing LICENSE

Hi, what is the license of that code? The LICENCE file is missing.

[Feature request] Playlist support

Please add playlist support. Many video collections of interest are organized in playlists and not channels. I don't know if the identifier for playlists is in a different namespace. yt-dlp support playlists.

No such file or directory: 'yt-dlp'

I tried to run the example python yt_fts.py download "https://www.youtube.com/@TimDillonShow/videos"
UC4woSp8ITBoYDmjkukhEhxg

and consistently end up with an error No such file or directory: 'yt-dlp'


Downloading channel
Saving vtt files to /var/folders/x7/0r36c9sn7yg7tvs5sdm471000000gn/T/tmpbrh06qzz
The Tim Dillon Show
Traceback (most recent call last):
  File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 273, in <module>
    cli()
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 31, in download
    download_channel(channel_id)
  File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 84, in download_channel
    subprocess.run([
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'yt-dlp'

Only fetch videos with CC

I think yt-dlp fetches all the videos in a channel, then fetches the stats of each video (checking to see if there are captions).

Large channels with single-digit number number of videos with captions are slow to download (and hit api limits).

The (paid and official) YouTube API allows you to retrieve the video IDs with captions in a specific channel.

curl

curl \
  'https://youtube.googleapis.com/youtube/v3/search?channelId=[ChannelID]&part=id&type=video&videoCaption=closedCaption&key=[KEY]' \
  --header 'Accept: application/json' \
  --compressed

response

{
  "kind": "youtube#searchListResponse",
  "etag": "995jyKTI3Q_SpXkNvcBCDR77qP0",
  "nextPageToken": "CAUQAA",
  "regionCode": "",
  "pageInfo": {
    "totalResults": 141,
    "resultsPerPage": 5
  },
  "items": [
    {
      "kind": "youtube#searchResult",
      "etag": "",
      "id": {
        "kind": "youtube#video",
        "videoId": ""
      }
    },
    {
      "kind": "youtube#searchResult",
      "etag": "",
      "id": {
        "kind": "youtube#video",
        "videoId": ""
      }
    },
    {
      "kind": "youtube#searchResult",
      "etag": "",
      "id": {
        "kind": "youtube#video",
        "videoId": ""
      }
    },
    {
      "kind": "youtube#searchResult",
      "etag": "",
      "id": {
        "kind": "youtube#video",
        "videoId": ""
      }
    },
    {
      "kind": "youtube#searchResult",
      "etag": "",
      "id": {
        "kind": "youtube#video",
        "videoId": ""
      }
    }
  ]
}

Support downloading a specific quote as audio or video.

Imagine you need a sound bite. Currently the workflow is as follows:

  1. You run download for all the subs.
  2. Then you search and find a quote that fits your needs.
  3. Now you need to manually download the file with yt-dl. yt-dl <link> or yt-dl -x <link>
  4. Next step is to cut the media file: ffmpeg -i <input file> -ss <ts> -t <duration> -acodec copy -vcodec copy <output. file>

A streamlined workflow could look like this:

  1. download channel subs
  2. search key words
  3. Get quote id from listing
  4. yt-fts quote-dl --audio <ID> to download sound or video bite. Maybe this needs a duration argument?
  5. You find a file name <video-ID>-<quote-ID><Sanitized Quote>.mp3 (or similar) in your working dir.

Done. yt-fts would download the file as specified (e.g. via --audio or --video) and cut it to bits.

Is this something that is in scope of this project? Do any user users have this use case?

Save database and config files to user .config folder

The script currently saves the database to the current working directory, ideally it should be some where in ~/.local/share/yt-fts/subtitles.db. I don't know the best practices for writing software that "invites itself" to a users config directories.

My general questions are:

  • Do I prompt the user for a config path or just make one without asking?
  • Where do I store these configs on different platforms?
  • Do packages installed through pypi have the system permissions to do this on their own?
  • How ispip uninstall yt-fts supposed to know where this is?

Support for Windows

Hi, this looks like a promising tool. A few points to hopefully help towards Windows support:

  1. The README should be updated with instructions to set up a venv using activate.bat,.

  2. What Python version(s) are supported? What versions do we know work with yt-fts?

  3. Current state on Windows fails to run download command. Here is the output from my terminal:

python yt_fts.py download "https://www.youtube.com/@TimDillonShow/videos"

UC4woSp8ITBoYDmjkukhEhxg
Downloading channel
Saving vtt files to C:\Users\FOO\AppData\Local\Temp\tmp6oqtgfyb
The Tim Dillon Show
Traceback (most recent call last):
  File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 273, in <module>
    cli()
  File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\USER\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 31, in download
    download_channel(channel_id)
  File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 84, in download_channel
    subprocess.run([
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1024, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1509, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified

README: How to use old subtitles db?

The program downloaded many subtitles. It took some time. But then i closed the terminal session. And now it does not appear to recognise the .db file in the current working path. Is there a way to specify the db file to use?

Cookies consent page

Hi,

First, thanks for this tool, really useful.

As reported on HN by Europe users, it exists a YouTube cookies consent page that blocks channel_id retrieving (first) and consequently, all other requests.

French version

English version

File ".../yt-fts/yt_fts.py", line 29, in download
    channel_id = get_channel_id(channel_url)
  File ".../yt-fts/yt_fts.py", line 176, in get_channel_id
    channel_id = re.search('channelId":"(.{24})"', html).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

I already faced this issue and adding a cookie indicating that consent has been given to a requests session can "solve" this.

s = requests.session()
s.cookies.set("CONSENT", "YES+1")
[...]
res = s.get(url)

In order to respect the initial goal of this consent page, we can ask the user to give its consent through a CLI argument like so:

python yt_fts.py download "https://www.youtube.com/@ycombinator/videos" --cookies_consent=1

It's just a suggestion as it can also be a question that prompt in CLI during download but this require to know that the user is in Europe (or it can apply to all users but it can be annoying if it's not really needed after all).

I tried to analyse "Reject all" selection behavior but the CONSENT cookie's content is still PENDING+{RANDOM NUMBER} (perhaps not random from Google's POV but I couldn't explain this value) so from my point of view only "Accept all" is "working".

Do you have any thoughts about this?

Kind regards,

Fix default database config not being created

on macos/linux default config path should be

 db_path = f"{os.path.join(os.getenv('HOME'), '.config', 'yt-fts')}/subtitles.db"

on windows

db_path = f"{os.path.join(os.getenv('APPDATA'), 'yt-fts')}/subtitles.db"

for some reason it's defaulting to the current directory

Alias for channel

Hi, first thanks for this useful package!

It would be great if it can support alias.

like

python3 yt_fts.py alias [NAME] [channel_id]
python3 yt_fts.py search [ALIAS_NAME or ID] [search text]

It would be better: when downloading, we can also specify the alias and it would create it automatically.

Update database

Hi, can I update my database without downloading all a subtitles of YouTube channel again?

Support Live Streamed Videos

It seems that the download command only downloads transcripts of the uploaded videos
It would be nice to also support videos which are live streamed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.