Code Monkey home page Code Monkey logo

weblock's People

Contributors

dependabot[bot] avatar jannis-baum avatar lucasliebe avatar tiny-fish-t avatar

Stargazers

 avatar

Watchers

 avatar

weblock's Issues

Logging consistency

  • create dedicated helper module for server/backend/validator.py, e.g. server/helpers/validator.py
  • add logging.py
    • create functions such as bold that wraps strings for bold printing
    • single source of truth for all strings that are printed anywhere
    • etc

Project website

  • example screenshots / video
  • take viewer through scenario, tell story
  • technical overview

Finish summarization, similarity & censorship

summarization

  • clean up articles before summarizing
  • prevent using same url twice
  • give option to scrape multiple queries
  • give option to set sources

similarity

  • give option to set censoring topic and constrain used summaries to corresponding database entries

censorship

  • find new threshold value (everybody)
  • improve performance
    • evaluating similarity to all summarizations in database takes too long (limit number of summarizations / pick newest or most relevant ones)
    • shorten summarizations significantly
  • if enough time: consider improving censorship formula (everybody)

Assert being in `venv` when starting scripts

  • create function that checks if environment variable VIRTUAL_ENV is set to path of our venv
  • call function in beginning of run_backend, scrape_positive and scrape_negative
  • throw error and exit if not in venv

Polish scraping

  • add argparse to scrape_negative and offer option to specify number of articles analogous to scrape_positive
    • scraping with narticles = 0 doesn't terminate in scrape_positive
    • provide default narticles > 0
  • print progress updates (positive & negative)
  • clean / filter data before insertion into positive database
    • the paragraph \_ has recently appeared in the database, this leads to no words being left after normalization and a topic vector with components nan (nan > 1 โ†’ will be the text matcher's favorite text to match)
    • "copyright", "published on date x", etc paragraphs are undesired

Get project ready for submission

  • installation script(s)
    • mock local database
    • use Python venvs
    • requirements.txt
    • download nltk stuff (and remove downloads from nlp.py)
  • make client-server interaction work on localhost (@lucasliebe)
    • give choice to user in extension for server to use
  • documentation (@jannis-baum)
    • add document to describe what would have to be done to make this scalable (e.g. database & RAM, server setup, program for easy modification (instead of venv), etc)
    • instructions for how to use (how to run install, scraping workflow and how to start client / server)
  • clean out real database
  • check performance on generic websites (everybody)
  • code consistency (@lucasliebe)
    • kebab vs camel case in non-.py-filenames
    • consider linter

Todo-List

  • use google news for summaries and clean database
  • using summaries for similarity
  • Check/ Make more reliable on multiple sites
  • Try to use NonSql DB and compare
  • prevent to use one source twice
  • include OpenAI into sentence generation
  • maybe pre generate sentences?

Find solution for `censoring-requirements`

info

  • currently hard-coded in run-backend
  • used by NLProcessor as constraint; a phrase has to have a word in common with synonyms of censoring-requirements to be considered similar

todo

  • do we still need this?
  • if so,
    • add it to .env or
    • infer it from provided search queries or
    • different idea

Text generation

- [ ] use OpenAI

  • pre-generation
    • scrape desired sentences from positive media
    • train BTM on scraping results
    • infer topics for scraping results
    • infer topics for client page
    • match given text with desired scraped content
  • #10

Clean up `NLProcessor`

  • reduce redundancy
  • add download for nltk components to install script to remove from nlp.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.