Code Monkey home page Code Monkey logo

cinemattr-db's Introduction

cinemattr-db

Backend and database of cinemattr.ca

How it works

  • Self querying retriever using LangChain, on a pinecone database containing popular movies (>1000 imdb rating count) released since 1950 to date.
  • OpenAI LLM translates user input to a vector database query.
  • Initial filtering is done through metadata columns (title, year, rating, actors etc.) using operators like > < = AND,OR.
  • Semantic search is done on the plot and summaries extracted for each movie.
  • Final 20 results (titles) are sent back as a response.
  • API hosted on an rate-limited AWS Lambda function.

How data is collected and loaded

  • Movie details and plot summaries are scraped from IMDb and Wikipedia. (db/airflow/dags/scrapers)
  • Airflow is used to orchestrate scraping jobs for every year. (db/airflow)
  • Data is loaded to a duckdb instance. (db/duckdb)
  • DBT is used for data transformation and cleanup (Clean text, create final tables, merge data from both sources) (db/dbt)
  • Plot summaries are loaded into a pinecone vector database (see api/load.ipynb).
  • Vector Embeddings are either
    • HuggingFace all-mpnet-base-v2, Free - api/hf_embeddings
    • OpenAI text-embedding-ada-002, Paid - $0.0001/1000 tokens - api

Building and Testing the Lambda API

Build

docker build -t cinemattr-api . --no-cache  --platform=linux/arm64
docker build -t cinemattr-api .  --platform=linux/arm64

Run

docker run -p 9000:8080  --env-file .env.dev cinemattr-api

Test query

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
        "queryStringParameters":
            { "query" : "owen wilson wow"}
    }'

Auth ECR

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $LAMBDA_ECR_REPO

Create Docker Repository

aws ecr create-repository --repository-name cinemattr-api --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE  --region us-east-1

Tag latest build

docker tag cinemattr-api $LAMBDA_ECR_REPO/cinemattr-api:latest

Push

docker push $LAMBDA_ECR_REPO/cinemattr-api:latest

Test Lambda Function URL

curl "$LAMBDA_API_URL?query=query"

Airflow

docker exec -it airflow-airflow-webserver-1 sh

Setting up data pipeline

Initialize environment

cd db
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Init Duckdb database

cd duckdb
python -m utils init

Init Airflow

cd airflow
docker compose up -d

Export final movies plot table

python -m utils db.duckdb export_movies

cinemattr-db's People

Contributors

carteakey avatar

Stargazers

Martin avatar Yu avatar SaraiQx avatar Benjamin Doherty avatar Luca G. Soave avatar

Watchers

Luca G. Soave avatar  avatar

Forkers

lgs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.