Code Monkey home page Code Monkey logo

embed-vtt's Introduction

Embed-VTT ✨

This repo uses openai embeddings and the pinecone vector database to generate and query embeddings from a VTT file.

The purpose of this repo was to implement semantic search to provide an extra resource in understanding Andrej Karpathy's latest video: Let's build GPT, however it is general enough to use for any transcript.

shoutout to miguel's yt-whisper library for helping with the youtube transcription. The data/ in this repo was generated using the small model.

Setup

Install

pip install -r requirements.txt

Environment

cp .env.sample .env

you'll need an API keys from openai & pinecone
OPENAI_KEY=***
PINECONE_KEY=***

Pinecone

Head over to pinecone and create an index with dimension 1536

Data

the data in this repo was generated from Let's build GPT using yt-whisper

  • /data/karpathy.vtt - contains the raw VTT file
  • /data/karpathy_embeddings.csv - contains the dataframe with the embeddings. you can use this file to directly seed your pinecone index

Usage

Generate Embeddings from VTT file

this will save an embedding csv file as {file_name}_embeddings.csv

python embed_vtt.py generate --vtt-file="data/karpathy.vtt"

Upload Embeddings from a CSV Embedding file

python embed_vtt.py upload --csv-embedding-file="data/karpathy_embeddings.csv"

Query Embeddings from text

python embed_vtt.py query --text="the usefulness of trill tensors"

sample output

0.81: But let me talk through it. It uses softmax. So trill here is this matrix, lower triangular ones. 00:54:52.240-00:55:01.440
0.81: but torches this function called trill, which is short for a triangular, something like that. 00:48:48.960-00:48:55.920
0.80: which is a very thin wrapper around basically a tensor of shape vocab size by vocab size. 00:23:17.920-00:23:23.280
0.79: I'm creating this trill variable. Trill is not a parameter of the module. So in sort of pytorch 01:19:36.880-01:19:42.160
0.79: does that. And I'm going to start to use the PyTorch library, and specifically the Torch.tensor 00:12:54.320-00:12:59.200

License

This script is open-source and licensed under the MIT License.

embed-vtt's People

Contributors

gmchad avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.