Code Monkey home page Code Monkey logo

duplicate-image-finder's Introduction

Duplicate Image Finder (difPy)

PyPIv PyPI status PyPI - Python Version PyPI - License

Tired of going through all images in a folder and comparing them manually to check if they are duplicates?

βœ… The Duplicate Image Finder (difPy) Python package automates this task for you!

pip install difPy

πŸ‘‰ difPy v2.4.x has some major updates and new features. Check out the release notes for a detailed listing.

πŸ‘ Our motto? The more users use difPy, the more issues and missing features can be detected, and the better the algorithm gets over time. Contributions are always welcome - check our contributor guidelines for more information.

Read more on how the algorithm of difPy works in my Medium article Finding Duplicate Images with Python.

Check out the difPy package on PyPI.org

Description

DifPy searches for images in one or two different folders, compares the images it found and checks whether these are duplicates. It then outputs the image files classified as duplicates and the filenames of the duplicate images having the lower resolution, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let difPy delete them for you.

DifPy does not compare images based on their hashes. It compares them based on their tensors i. e. the image content - this allows difPy to not only search for duplicate images, but also for similar images.

Basic Usage

Use the following function to make difPy search for duplicates within one specific folder and its subfolders:

from difPy import dif
search = dif("C:/Path/to/Folder/")

To search for duplicates within two folders and their subfolders:

from difPy import dif
search = dif("C:/Path/to/Folder_A/", "C:/Path/to/Folder_B/")

Folder paths must be specified as a Python string.

πŸ““ For a detailed usage guide, please view the official difPy Usage Documentation.

Output

DifPy gives two types of output that you may use depending on your use case:

A dictionary of duplicates/similar images that were found, where the keys are a unique id for each image file:

search.result

> Output:
{20220824212437767808 : {"filename" : "image1.jpg",
                         "location" : "C:/Path/to/Image/image1.jpg"},
                         "duplicates" : ["C:/Path/to/Image/duplicate_image1.jpg",
                                         "C:/Path/to/Image/duplicate_image2.jpg"]},
...
}

A list of duplicates/similar images that have the lowest quality:

search.lower_quality

> Output:
["C:/Path/to/Image/duplicate_image1.jpg", 
 "C:/Path/to/Image/duplicate_image2.jpg", ...]

DifPy can also generate a dictionary with statistics on the completed process:

search.stats

> Output:
{"directory_1" : "C:/Path/to/Folder_A/",
 "directory_2" : "C:/Path/to/Folder_B/",
 "duration" : {"start_date": "2022-06-13",
               "start_time" : "14:44:19",
               "end_date" : "2022-06-13",
               "end_time" : "14:44:38",
               "seconds_elapsed" : 18.6113},
 "similarity_grade" : "normal",
 "similarity_mse" : 200,
 "total_images_searched" : 1032,
 "total_dupl_sim_found" : 1024}

CLI Usage

You can make use of difPy through the CLI interface by using the following commands:

python dif.py -A "C:/Path/to/Folder_A/"

python dif.py -A "C:/Path/to/Folder_A/" -B "C:/Path/to/Folder_B/"

It supports the following arguments:

dif.py [-h] -A DIRECTORY_A [-B [DIRECTORY_B]] [-Z [OUTPUT_DIRECTORY]] 
       [-s [{low,normal,high,int}]] [-px [PX_SIZE]] [-p [{True,False}]] [-o [{True,False}]]
       [-d [{True,False}]] [-D [{True,False}]]

The output of difPy is then written to files and saved in the working directory by default, or to the folder specified in the -Z / -output_directory parameter. The "xxx" in the filename is a unique timestamp:

difPy_results_xxx.json
difPy_lower_quality_xxx.txt
difPy_stats_xxx.json

πŸ““ For a detailed usage guide, please view the official difPy Usage Documentation.

Additional Parameters

DifPy has the following optional parameters:

dif(directory_A, directory_B, similarity="normal", px_size=50, 
    show_progress=True, show_output=False, delete=False, silent_del=False)

similarity (str, int)

Depending on which use-case you want to apply difPy for, the granularity for the classification of the images can be adjusted. DifPy can f. e. search for exact matching duplicate images, or images that look similar, but are not necessarily duplicates.

"normal" = (recommended, default) searches for duplicates with a certain tolerance

"high" = searches for duplicate images with extreme precision, f. e. for use when comparing images that contain a lot of details like f. e. text

"low" = searches for similar images

To customize the classification threshold and define the MSE value manually, you can set similarity to any integer.

px_size (int)

! Recommended not to change default value

Absolute size in pixels (width x height) that the images will be compressed to before being compared. The higher the px_size, the more computational ressources and time required.

show_progress (bool)

Per default, difPy will set this parameter to True, so that you can see where your lengthy processing is. Change this value to False to disable the progress bar.

False= (default) no progress bar is shown

True = outputs a progress bar

show_output (bool)

Per default, difPy will output only the filename of the duplicate images it found. If you want the duplicate images to be shown in the console output, change this value to True.

False= (default) outputs filename of the duplicate/similar images found

True = outputs a sample and the filename

delete (bool)

! Please use with care, as this cannot be undone

When set to True, the lower resolution duplicate images that were found by difPy are deleted from the folder. Asks for user confirmation before deleting the images. To skip the user confimation, set silent_del to True.

silent_del (bool)

! Please use with care, as this cannot be undone

When set to True, the user confirmation is skipped and the lower resolution duplicate images that were found by difPy are automatically deleted from the folder.

Similar Work

I. DifPy as Webapp

A Streamlit based Webapp to find duplicate images from single/multiple directories - 🧬 based on difPy

Single Directory πŸ“Έβœ… demo1

Two directories πŸ“Έβœ… demo2

II. Mac Photos Tool to find Duplicates (photosdup)

Tool to scan a Mac Photos library for duplicates, thumbnails etc. - ✨ inspired by difPy


πŸ’­ Also want to be featured in the "Related Projects" section? Check our contributor guidelines to find out how!

duplicate-image-finder's People

Contributors

elisemercury avatar ppizarror avatar bemau avatar ethanmann avatar prateekralhan avatar valexandrin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.