jannis-baum / weblock Goto Github PK

An ad-blocker that secretly censors content at the deployer's disposal. HPI based project aiming to raise awareness about possible adverse effects of AI.

License: MIT License

Python 80.46% JavaScript 10.90% Shell 7.73% HTML 0.91%

btm nlp research short-text-semantic-similarity text-manipulation text-matching topic-modeling

weblock's Issues

Logging consistency

create dedicated helper module for server/backend/validator.py, e.g. server/helpers/validator.py
add logging.py
- create functions such as bold that wraps strings for bold printing
- single source of truth for all strings that are printed anywhere
- etc

Project website

example screenshots / video
take viewer through scenario, tell story
technical overview

Finish summarization, similarity & censorship

summarization

clean up articles before summarizing
prevent using same url twice
give option to scrape multiple queries
give option to set sources

similarity

give option to set censoring topic and constrain used summaries to corresponding database entries

censorship

find new threshold value (everybody)
improve performance
- evaluating similarity to all summarizations in database takes too long (limit number of summarizations / pick newest or most relevant ones)
- shorten summarizations significantly
if enough time: consider improving censorship formula (everybody)

Assert being in `venv` when starting scripts

create function that checks if environment variable VIRTUAL_ENV is set to path of our venv
call function in beginning of run_backend, scrape_positive and scrape_negative
throw error and exit if not in venv

documentation needs to mention that python3-venv is required

the title says it all

Polish scraping

add argparse to scrape_negative and offer option to specify number of articles analogous to scrape_positive
- scraping with narticles = 0 doesn't terminate in scrape_positive
- provide default narticles > 0
print progress updates (positive & negative)
clean / filter data before insertion into positive database
- the paragraph \_ has recently appeared in the database, this leads to no words being left after normalization and a topic vector with components nan (nan > 1 → will be the text matcher's favorite text to match)
- "copyright", "published on date x", etc paragraphs are undesired

Get project ready for submission

Keep `gensim` data in virtual environment

install script

Todo-List

use google news for summaries and clean database
using summaries for similarity
Check/ Make more reliable on multiple sites
Try to use NonSql DB and compare
prevent to use one source twice
include OpenAI into sentence generation
maybe pre generate sentences?

Find example censorship subject and mention it in `README.md`

Scrape positive infers all paragraphs of database & throw errors

paragraphs that already existed before scraping are inferred again leading to duplicated vectors
print error if run without -t and no model exists

Find solution for `censoring-requirements`

info

currently hard-coded in run-backend
used by NLProcessor as constraint; a phrase has to have a word in common with synonyms of censoring-requirements to be considered similar

todo

do we still need this?
if so,
- add it to .env or
- infer it from provided search queries or
- different idea

Clear out temporary files after `scrape-positive`

Text generation

~~- [ ] use OpenAI~~

pre-generation
- scrape desired sentences from positive media
- train BTM on scraping results
- infer topics for scraping results
- infer topics for client page
- match given text with desired scraped content
#10

Clean up `NLProcessor`

reduce redundancy
add download for nltk components to install script to remove from nlp.py

jannis-baum / weblock Goto Github PK

weblock's People

Contributors

Stargazers

Watchers

weblock's Issues

summarization

similarity

censorship

info

todo

Recommend Projects

Recommend Topics

Recommend Org