scripts to search in a collection of youtube subtitles. In what video did your favourite youtube person use a certain word? Now you can find out without going through their videos manually.
the search scripts expect a collection of srt clean subtitles.
you can use autocollect_subs.py for this.
python autocollect_subs.py {URL}
the URL can be anything that youtube-dl can get subtitles from. e.g. youtube channel,youtube-playlist, or just a single video.
the default option right now is to download all subtitles, consider changing them to you need eg '--write-auto-sub', '--sub-lang', 'en',
Windows10 64bit, making it work for other platfrom should only require minimum efford though, Python 3.x installed.
All of these files in the same folder as the autocollect_subs.py script:
downsub.com gives you clean srt subtitles from youtube videos right away.
youtube-dl can take care of that for us. ./youtube-dl --write-auto-sub --sub-lang de --sub-format vtt --skip-download [channel or playlist URL] -o "%(upload_date)s-%(id)s.%(ext)s" this gives us a collection of vtt files
youtube-dl can convert subs using ffmpeg, but only when the video is downloaded. use ffmpeg directly to convert .vtt subtitles to .srt.
./ffmpeg -i '{vtt-FILE}.vtt' '{SRT-FILE}.srt'
you can use the 'convert_to_srt_local' scripts to convert serveral subtitles in an automated way. place the script in a folder with .vtt subs and ffmpeg.exe, then execute the script.
When you download youtube subtitles with youtube-dl and convert them to .srt using ffmpeg, the subtitles have overlap and double lines. This go script from nimatrueway fixes that.
usage
./subtitle-overlap-fixer '{SRT-FILE}.srt'
uses find to search in srt lines. expects a folder srt_fixed as a subfolder (as produced byautocollect_subs.py )
'python listsearch.py' term1,term2
or just run `'python listsearch.py' and enter the search terms manually
uses fuzzymatching for finding words
have the module 'fuzzywuzzy' installed pip install fuzzywuzzy
use
'python listsearch.py' term1,term2
output a search results.csv link goes directly to video with time where the line with the term is said.
resline,id,duration,resurl
Line with term1,XXXXXXXXXXX,"00:06:49,199 --> 00:06:51,810",https://youtu.be/XXXXXXXXXXX?t=409
The Sofware AntConc is a gratis Concordance Tool, it might be usefull to for deeper analysis. you can use mae txt.py to convert the fixed srt subtitles to txt files for corpus analysis. More tools https://corpus-analysis.com/