Code Monkey home page Code Monkey logo

nitec427 / fastdup Goto Github PK

View Code? Open in Web Editor NEW

This project forked from visual-layer/fastdup

0.0 0.0 0.0 663.39 MB

FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similaritity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be

License: Other

C++ 1.62% Python 98.25% Dockerfile 0.13%

fastdup's Introduction

fastdup

Easily Manage, Clean & Curate Visual Data at Scale

fastdup is a tool for gaining insights from a large image/video collection. It can find anomalies, duplicate and near duplicate images/videos, clusters of similarity, learn the normal behavior and temporal interactions between images/videos. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging.

fastdup is:

  • Unsupervised: fits any dataset
  • Scalable : handles 400M images on a single machine
  • Efficient: works on CPU only
  • Low Cost: can process 12M images on a $1 cloud machine budget

From the authors of GraphLab and Turi Create.

Open In Colab Open In Kaggle Slack Medium Mailing list

Large Image Datasets Today are a Mess Blog | Processing LAION400m Video


Fastdup solves the following problems:


Just 2 lines of code to get you started:

fastdup

Quick installation

  • Python 3.7, 3.8, 3.9, 3.10
  • Supported OS: Ubuntu 20.04, Ubuntu 18.04, Debian 10, Mac OSX M1, Mac OSX Intel, Amazon Linux 2, CentOS 7, RedHat 4.8, Windows 10 Server.
# upgrade pip to its latest version
python3.XX -m pip install -U pip
# install fastdup
python3.XX -m pip install fastdup

Where XX is your python version. For Windows, CentOS 7.X, RedHat 4.8, Amazon Linux 2 and other older Linux see our Insallation instructions.

Full documentation

Full documentation is here

Running the code

import fastdup
fastdup.run(input_dir="/path/to/your/folder", work_dir='out', nearest_neighbors_k=5, turi_param='ccthreshold=0.96')    #main running function.
fastdup.create_duplicates_gallery('out/similarity.csv', save_path='.')     #create a visual gallery of found duplicates
fastdup.create_outliers_gallery('out/outliers.csv',   save_path='.')       #create a visual gallery of anomalies
fastdup.create_components_gallery('out', save_path='.')                    #create visualiaiton of connected components
fastdup.create_stats_gallery('out', save_path='.', metric='blur')          #create visualization of images stastics (for example blur)
fastdup.create_similarity_gallery('out', save_path='.',get_label_func=lambda x: x.split('/')[-2])     #create visualization of top_k similar images assuming data have labels which are in the folder name
fastdup.create_aspect_ratio_gallery('out', save_path='.')                  #create aspect ratio gallery

alt text Working on the Food-101 dataset. Detecting identical pairs, similar-pairs (search) and outliers (non-food images..)

Getting started examples

Detailed instructions

User community contributions

Stroke AIS Data Tire Data Butterfly Mimics Drugs and Vitamins Plastic Bottles Micro Organisms PCB Boards ZebraFish Whats the difference

Support and feature requests

Join our Slack channel

Disclaimer

Usage Tracking

We have added experimental crash report collection, using sentry.io. It does not collect user data other than anonymized IP address data, and it only logs fastdup library's own actions. We do NOT collect folder name, user name, image names, image content only aggregate performance statistics like total number of images, average runtime per image, total free memory, total free disk space, number of cores etc. Collecting fastdup crashes will help us improve stability.

The code for the data collection is found here. On MAC we use Google crashpad.

It is always possible to opt out of the experimental crash report collection via either of the following two options:

  • Define an environment variable called SENTRY_OPT_OUT
  • or run() with turi_param='run_sentry=0'

fastdup enterprise edition

Visual Layer

About us

Danny Bickson, Amir Alush

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.