Code Monkey home page Code Monkey logo

youtubesubsearcher's Introduction

youtubesubsearcher

scripts to search in a collection of youtube subtitles. In what video did your favourite youtube person use a certain word? Now you can find out without going through their videos manually.

Step1 Build a collection of youtube subtitles

the search scripts expect a collection of srt clean subtitles. you can use autocollect_subs.py for this. python autocollect_subs.py {URL} the URL can be anything that youtube-dl can get subtitles from. e.g. youtube channel,youtube-playlist, or just a single video. the default option right now is to download all subtitles, consider changing them to you need eg '--write-auto-sub', '--sub-lang', 'en',

setup

Windows10 64bit, making it work for other platfrom should only require minimum efford though, Python 3.x installed.

All of these files in the same folder as the autocollect_subs.py script:

manually

external services

downsub.com gives you clean srt subtitles from youtube videos right away.

youtube-dl and ffmpeg

youtube-dl can take care of that for us. ./youtube-dl --write-auto-sub --sub-lang de --sub-format vtt --skip-download [channel or playlist URL] -o "%(upload_date)s-%(id)s.%(ext)s" this gives us a collection of vtt files

youtube-dl can convert subs using ffmpeg, but only when the video is downloaded. use ffmpeg directly to convert .vtt subtitles to .srt.

./ffmpeg -i '{vtt-FILE}.vtt' '{SRT-FILE}.srt'

you can use the 'convert_to_srt_local' scripts to convert serveral subtitles in an automated way. place the script in a folder with .vtt subs and ffmpeg.exe, then execute the script.

subtitle-overlap-fixer

When you download youtube subtitles with youtube-dl and convert them to .srt using ffmpeg, the subtitles have overlap and double lines. This go script from nimatrueway fixes that.

usage ./subtitle-overlap-fixer '{SRT-FILE}.srt'

Step2 search collection of youtube subtitles

listfind.py

uses find to search in srt lines. expects a folder srt_fixed as a subfolder (as produced byautocollect_subs.py )

'python listsearch.py' term1,term2 or just run `'python listsearch.py' and enter the search terms manually

listsearch.py

uses fuzzymatching for finding words have the module 'fuzzywuzzy' installed pip install fuzzywuzzy

use 'python listsearch.py' term1,term2

output a search results.csv link goes directly to video with time where the line with the term is said.

resline,id,duration,resurl

Line with term1,XXXXXXXXXXX,"00:06:49,199 --> 00:06:51,810",https://youtu.be/XXXXXXXXXXX?t=409

Other Tools

The Sofware AntConc is a gratis Concordance Tool, it might be usefull to for deeper analysis. you can use mae txt.py to convert the fixed srt subtitles to txt files for corpus analysis. More tools https://corpus-analysis.com/

youtubesubsearcher's People

Contributors

bindestriche avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.