The hypertag's discuss from ravn-tech

Add automatic file tag suggestions by file content using Machine Learning

When a new file is added, automatically infer tags from semantically similar existing files tags.

Depends on #24

Support relative file paths

This will enable to sync the hypertag.db across different machines / devices, while still working with relative file paths.

Add FS hooks to HyperTagFS dir to detect file moving / deleting (map into DB)

Watchdog looks like what we need: https://pythonhosted.org/watchdog/quickstart.html

Add slick web client

This will make HyperTag accessible for a broader audience

Add optional values to tags

hypertag tag new_year_resolution.txt with year=2021

Evaluate CLIP performance on text to text similarity

If CLIP performs as good as DistilBERT, there is no need for DistilBERT anymore.

Improve file and tag insertion performance

Use transactions (only commit once at the end)

Add image search to HyperTagFS

Create a dedicated directory called "Search Images". All directories names created in "Search Images" are interpreted as search queries for image files and accordingly populated with the results.

Add CLI auto-completion for existing tags

Add metatagging support to normal tagging using parent/children/baby syntax

Speed up vectorization with batch processing

Add test cases

Test basic functions that are unlikely to change behavior:

add file
import directory
add tag
add metatag
query
index (check text cleaning works for challenging file examples for pdf, html, etc.)

Save text tokens per document / page

Add new tables:

text_tokens: file_id, page_id, token_id
tokens: token_id, name

Detect leaf files and place them directly (not in files subdir)

Identify file duplicates

Add hash and size columns to files table.
On add: compute hash and size -> Ignore duplicates.

Improve text search by matching tokens

Text search happens right now only in vector space and thus ignores exact query token matches (which are a high signal though).

Depends on #32

Add automatic file tagging by file type

Auto tag file with extension (type), e.g. JPG, PNG, TXT, PDF, PY, JS
Auto tag file with group, e.g. Image (JPG, PNG), Document (TXT, PDF), Source (PY, JS)

Make semantic search optional

Semantic search comes with fairly big dependencies that some users may not can / want to download.

Add CPU / GPU toggle option

Currently things stop working if no CUDA GPU is available. This is bad. Make CUDA optional (allow CPU only usage). Looks like CLIP does not work without CUDA...

Add text search to HyperTagFS

Create a dedicated directory called "Search Texts". All directories names created in "Search Texts" are interpreted as search queries for text documents and accordingly populated with the results.

Add semantic video search

First basic version: Partition video into e.g. 16 uniformly spaced (by time) sections and take a screenshot. Embed each screenshot and use average as video embedding.

Advanced: Partition video with higher granularity and extract frames e.g. every 5 seconds or fixed high number (+100). Compute embedding for every extracted frame. Compute pairwise consecutive frame distances in embedding space to infer semantically coherent video sections (similar frames). Embed each section as average of coherent frames (below a threshold). The list of average frame embeddings should be a pretty good representation of the video and comes with section start & end metadata.

Add option to merge two tags

$ hypertag merge A into B

Moves all file association from A to B

Add remove file/s function

Semantic search for images

Allow to search for image files using both text and images as queries

Evaluate textract for audio files

https://textract.readthedocs.io/en/stable/#currently-supporting

Improve query UX using fuzzy word matching

Fuzzy String Matching: https://github.com/seatgeek/fuzzywuzzy

Related to #9

Add set theory querying (union, intersection, etc.)

Improve query UX using synonym detection

Match semantically very similar words. For example if files are tagged with science and research is queried it should match. Definitely add a toggle to turn this feature off as some users may find it confusing.

Related to #9

Add indices to improve SQLite query performance

HyperTagFS: Let user create directories with names as queries

Use Case: User creates a directory named: animal minus human -> directory should contain all files associated with animals minus human files.

Depends on #10 & #18

Update HyperTagFS dir lazily

Right now the whole HyperTagFS directory gets rebuild on every tag changing operation. Instead only make partial updates.

Extract text from blobby text file formats (PDF, etc.)

Extend daemon process to load the semantic search model and serve as single oracle

Needs fast and reliable IPC to work out

Visualize the HyperTag graph

Candidates:

https://graph-tool.skewed.de/static/doc/quickstart.html
- Pro: Performance (fast -> C++ wrapper)
- Con: Size (big), no pip install (cuz C++)
https://github.com/networkx/networkx
- Pro: well tested
https://github.com/igraph/python-igraph
- Pro: Performance (fast -> C wrapper)
https://github.com/root-11/graph-theory
- Pro: Size (tiny)
- Con: Performance (slow?)

Add migration (import) option for TMSU users

Add matrix community chat

Speed up semantic search using spatial index DS

Use a spatial index data structure (tree or hash based) -> https://github.com/nmslib/hnswlib/

Add option --auto to import

This will tell the daemon to automatically watch the imported directory for new files and renames.

Evaluate image to text search

Powered by CLIP

Semantic search for text documents

Vectorize all text documents and let the user search them.

Related to #24 and #9

Just eyeballing: Glove model (average_word_embeddings_glove.6B.300d) seems to perform better than DistilBERT (stsb-distilbert-base), add some small benchmark tests with common and diverse papers and queries.

Models:
https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0

tesseract-ocr/tesseract#263 (comment)

Even better: Find a solid GPU accelerated OCR implementation:

https://github.com/jaidedai/easyocr (looks promising but rly aweful CPU performance and too big model sizes for my lil GPU with 2GB VRAM)
https://github.com/Xilinx/pytorch-ocr/blob/master/README.md
https://github.com/Calamari-OCR/calamari
https://github.com/faustomorales/keras-ocr

ravn-tech / hypertag Goto Github PK

hypertag's Issues

Recommend Projects

Recommend Topics

Recommend Org