Code Monkey home page Code Monkey logo

duplicate-image-finder's Introduction

Duplicate Image Finder (difPy)

PyPIv PyPI status Documentation Status PyPI - Python Version Downloads PyPI - License

Tired of going through all images in a folder and comparing them manually to check if they are duplicates?

✅ The Duplicate Image Finder (difPy) Python package automates this task for you!

pip install difPy

👉 🆕 difPy v4 is out! difPy v4 comes with up to 10x more performance than previous difPy versions. Check out the release notes for details.

👐 Our motto? We ❤️ Open Source! Contributions and new ideas for difPy are always welcome - check our Contributor Guidelines for more information.

Read more on how the algorithm of difPy works in my Medium article Finding Duplicate Images with Python.

Check out the difPy package on PyPI.org


Description

difPy searches for images in one or more different folders, compares the images it found and checks whether these are duplicates. It then outputs the image files classified as duplicates as well as the images having the lowest resolutions, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let difPy delete them for you.

difPy does not compare images based on their hashes. It compares them based on their tensors i. e. the image content - this allows difPy to not only search for duplicate images, but also for similar images.

difPy leverages Python's multiprocessing capabilities and is therefore able to perform at high performance even on large datasets.

📓 For a detailed usage guide, please view the official difPy Usage Documentation.

Table of Contents

  1. Basic Usage
  2. Output
  3. Additional Parameters
  4. CLI Usage
  5. difPy Web App

Basic Usage

To make difPy search for duplicates within one folder:

import difPy
dif = difPy.build('C:/Path/to/Folder/')
search = difPy.search(dif)

To search for duplicates within multiple folders:

import difPy
dif = difPy.build(['C:/Path/to/Folder_A/', 'C:/Path/to/Folder_B/', 'C:/Path/to/Folder_C/', ... ])
search = difPy.search(dif)

Folder paths can be specified as standalone Python strings, or within a list. With difPy.build(), difPy first scans the images in the provided folders and builds a collection of images by generating image tensors. difPy.search() then starts the search for duplicate images.

📓 For a detailed usage guide, please view the official difPy Usage Documentation.

Output

difPy returns various types of output that you may use depending on your use case:

I. Search Result Dictionary

A JSON formatted collection of duplicates/similar images (i. e. match groups) that were found. Each match group has a primary image (the key of the dictionary) which holds the list of its duplicates including their filename and MSE (Mean Squared Error). The lower the MSE, the more similar the primary image and the matched images are. Therefore, an MSE of 0 indicates that two images are exact duplicates.

search.result

> Output:
{'C:/Path/to/Image/image1.jpg' : [['C:/Path/to/Image/duplicate_image1a.jpg', 0.0], 
                                  ['C:/Path/to/Image/duplicate_image1b.jpg', 0.0]],
 'C:/Path/to/Image/image2.jpg' : [['C:/Path/to/Image/duplicate_image2a.jpg', 0.0]],
 ...
}

II. Lower Quality Files

A list of duplicates/similar images that have the lowest quality among match groups:

search.lower_quality

> Output:
['C:/Path/to/Image/duplicate_image1.jpg', 
 'C:/Path/to/Image/duplicate_image2.jpg', ...]

Lower quality images then can be moved to a different location:

search.move_to(destination_path='C:/Path/to/Destination/')

Or deleted:

search.delete(silent_del=False)

III. Process Statistics

A JSON formatted collection with statistics on the completed difPy processes:

search.stats

> Output:
{'directory': ['C:/Path/to/Folder_A/', 'C:/Path/to/Folder_B/', ... ],
 'process': {'build': {'duration': {'start': '2024-02-18T19:52:39.479548',
                                    'end': '2024-02-18T19:52:41.630027',
                                    'seconds_elapsed': 2.1505},
                       'parameters': {'recursive': True,
                                      'in_folder': False,
                                      'limit_extensions': True,
                                      'px_size': 50,
                                      'processes': 5},
                        'total_files': {'count': 3232},
                        'invalid_files': {'count': 0, 
                                          'logs': {}}},
             'search': {'duration': {'start': '2024-02-18T19:52:41.630027',
                                     'end': '2024-02-18T19:52:46.770077',
                                     'seconds_elapsed': 5.14},
                        'parameters': {'similarity_mse': 0,
                                       'rotate': True,
                                       'lazy': True,
                                       'processes': 5,
                                       'chunksize': None},
                        'files_searched': 3232,
                        'matches_found': {'duplicates': 3030, 
                                          'similar': 0}}}}

Additional Parameters

difPy supports the following parameters:

difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50, 
            show_progress=True, processes=None)
difPy.search(difpy_obj, similarity='duplicates', rotate=True, lazy=True, show_progress=True, 
             processes=None, chunksize=None)

📓 For a detailed usage guide, please view the official difPy Usage Documentation.

CLI Usage

difPy can also be invoked through the CLI by using the following commands:

python dif.py #working directory

python dif.py -D 'C:/Path/to/Folder/'

python dif.py -D 'C:/Path/to/Folder_A/' 'C:/Path/to/Folder_B/' 'C:/Path/to/Folder_C/'

👉 Windows users can add difPy to their PATH system variables by pointing it to their difPy package installation folder containing the difPy.bat file. This adds difPy as a command in the CLI and will allow direct invocation of difPy from anywhere on the device.

difPy CLI supports the following arguments:

dif.py [-h] [-D DIRECTORY [DIRECTORY ...]] [-Z OUTPUT_DIRECTORY] 
       [-r {True,False}] [-i {True,False}] [-le {True,False}] 
       [-px PX_SIZE]  [-s SIMILARITY] [-ro {True,False}]
       [-la {True,False}] [-proc PROCESSES] [-ch CHUNKSIZE] 
       [-mv MOVE_TO] [-d {True,False}] [-sd {True,False}]
       [-p {True,False}]
Parameter Parameter
-D directory -la lazy
-Z output_directory -proc processes
-r recursive -ch chunksize
-i in_folder -mv move_to
-le limit_extensions -d delete
-px px_size -sd silent_del
-s similarity -p show_progress
-ro rotate

If no directory parameter is given in the CLI, difPy will run on the current working directory.

When running from the CLI, the output of difPy is written to files and saved in the working directory by default. To change the default output directory, specify the -Z / -output_directory parameter. The "xxx" in the output filenames is the current timestamp:

difPy_xxx_results.json
difPy_xxx_lower_quality.json
difPy_xxx_stats.json

📓 For a detailed usage guide, please view the official difPy Usage Documentation.

difPy Web App

difPy can also be accessed via a browser. With difPy Web, you can compare up to 200 images and download a deduplicated ZIP file - all powered by difPy. Read more.

📱 Try the new difPy Web App!


❤️ Open Source

duplicate-image-finder's People

Contributors

arieker avatar bemau avatar elisemercury avatar ethanmann avatar madfist avatar pankovea avatar ppizarror avatar prateekralhan avatar valexandrin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

duplicate-image-finder's Issues

Make the package installable and usable via `pipx`

Thanks for this package! Currently using this right now to cleanup my hard drive.

Packages like flake8, pytest, ruff can be run in the cli just by typing their package name.

I noticed that this package does not do that and you have to cd into the directory where you installed this based on this issue: #49

Try running pipx install ruff and it automatically creates a virtualenv and adds ruff to the $PATH and you can use it anywhere.


I think it would be great to have this package support it.

This is the output when I run pipx install difPy

pipx install difPy
Note: Dependent package 'numpy' contains 3 apps
  - f2py
  - f2py3
  - f2py3.11
Note: Dependent package 'fonttools' contains 4 apps
  - fonttools
  - pyftmerge
  - pyftsubset
  - ttx

Near duplicate Image detection

Hello, first of all thanks for creating this package It is really good package for detecting Duplicate images.
I have tried this package I have found that it is able to detect images which are 100% similarity but I have found that it was not able to detect the images when similarity is not 100% even if similarity is 99.99% or less not able to detect image.
I have tried to play with the pixel values and similarity but than also it was not able to detect. So, is there ways to detect such image which having similarity score less than 100% by using difpy package.

I have attached few images which it was not able to detect.
Note:- The percentage values which I have refereed many times found from matchTemplate method the images which are attached having similarity is 99%.

TOI_Delhi_12-07-2022_4_1
TOI_Delhi_12-07-2022_4_2
TOI_Delhi_12-07-2022_4_3
TOI_Delhi_12-07-2022_4_7
TOI_Delhi_12-07-2022_4_8

Multiprocessing

Hi, thanks for your work. Is it possible to add multiprocessing? I can see that only 1 core is fully utilised. I had a folder with 30k images and processing took too long

Same duplicate in different keys

We have found that when you use dif within a folder of folders, there may be some unexpected behaviour. In our case, we have a pair of duplicates in one folder, and a third duplicate in another one. This makes it so result will output:

image

So an element that was detected as duplicate is being used later as a key. We do not know if this is bug or a feature, but it may be inconsistent with the behavior of not repeating duplicates in later keys. Still, for our use we can just use a set() as a workaround to ignore "duplicates of duplicates".

Nice work on the tool, it has helped us a lot with a nasty database. Thank you, have a nice day!

5m0RZ

PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

Great tool!
I learned a lot reading the article you wrote about this as well.

I tested it on some of my files, but found that I had some PNGs that were just line-art (black line-art on transparent background) were flagged as duplicate when they were completely different, even on high sensitivity. In fact, the listed MSE is 0.00

They also did not render properly during the image comparison when running -d False, with both image previews looking like black squares.
Note: This does not apply to line-art of a different color on transparent background, only black.

I am not familiar with how the PNG file format encodes black vs transparent, but I believe that the issue stems from that.

Screen Shot 2022-07-22 at 1 57 07 AM

AttributeError: 'PosixPath' object has no attribute 'is_relative_to'

Hi,
on Google Colab I got this error:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
----> 1 src_res= dif("/content/02_Draw_append", "/content/02_Draw")

1 frames
/usr/local/lib/python3.8/dist-packages/difPy/dif.py in init(self, directory_A, directory_B, recursive, similarity, px_size, show_progress, show_output, delete, silent_del)
70 directory_A = dif._process_directory(directory_A)
71 directory_B = dif._process_directory(directory_B)
---> 72 dif._path_validation([directory_A, directory_B])
73 img_matrices_A, folderfiles_A = dif._create_imgs_matrix(directory_A, px_size, recursive, show_progress)
74 img_matrices_B, folderfiles_B = dif._create_imgs_matrix(directory_B, px_size, recursive, show_progress)

/usr/local/lib/python3.8/dist-packages/difPy/dif.py in _path_validation(paths)
143 raise ValueError('An attempt to compare the directory with itself.')
144 path1, path2 = paths
--> 145 if path1.is_relative_to(path2) or path2.is_relative_to(path1):
146 raise ValueError('One directory belongs to another.')
147

AttributeError: 'PosixPath' object has no attribute 'is_relative_to'`

the dif call is for two different folders:
src_res= dif("/content/02_Draw_append", "/content/02_Draw")

All works fine if I call it on the some folder like:
src_res= dif("/content/02_Draw_append")

Here you can find pip instal info:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: difPy in /usr/local/lib/python3.8/dist-packages (2.4.5) Requirement already satisfied: opencv-python in /usr/local/lib/python3.8/dist-packages (from difPy) (4.6.0.66) Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from difPy) (3.2.2) Requirement already satisfied: scikit-image in /usr/local/lib/python3.8/dist-packages (from difPy) (0.18.3) Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from difPy) (1.21.6) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->difPy) (1.4.4) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->difPy) (2.8.2) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->difPy) (3.0.9) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->difPy) (0.11.0) Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (7.1.2) Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (2022.10.10) Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (2.9.0) Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (2.8.8) Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (1.4.1) Requirement already satisfied: scipy>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from scikit-image->difPy) (1.7.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.1->matplotlib->difPy) (1.15.0)

Tks.

Which files are deleted if comparing and deleting within two folders? (Folder A and Folder B)

I thought it is always the images in folder at the 2nd argument. But actually it is not deterministic. In rare cases a file in the first folder argument gets deleted.
It happened only once within 69 deletions.

See example log for dif("F:\outtakes\outtakes_pos\bboxes", "F:\1.0.0\experiment_pos\bbox", recursive=False, delete=True):

...
Deleted file: F:\experiment_pos\bbox\89738260-3d58-4045-ba7d-b534ddff2b82_2.png
Deleted file: F:\experiment_pos\bbox\7edc9a3e-a9f6-4229-bbfc-62cf97495f4e_2.png
Deleted file: F:\outtakes\outtakes_pos\bboxes\77c4d676-050d-43fe-ab9f-13740c77763f_2.png
Deleted file: F:\experiment_pos\bbox\f8e88fc0-f5c3-4a1e-a2e8-18a252bad860_2.png
Deleted file: F:\experiment_pos\bbox\7f9b605c-159a-4a8d-8999-b0223d7ab7d1_2.png
...

search in Sub directories

Hi Elise!

Thank you for existing!

My Onedrive duplicated my library about 4years ago, that and countless backups from WhatsApp and messager, A 550GB mess, yeah you get the point.

I'm really new to coding and git so figure ill postcode instead, it's not clean but I'm pressed on time studying applied data science and working as a product manager.

I have a few more ideas, but the code below was necessary for me right now :)

Code finds photos in all subdirectories (folder in a folder) in the given file paths.
Code I have added is commented:
#added by Kristofer from
#added by Kristofer to

`import skimage.color
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
import imghdr
import time
import collections
#added kristofer
from pathlib import Path

class dif:

def __init__(self, directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False, silent_del=False):
    """
    directory_A (str)......folder path to search for duplicate/similar images
    directory_B (str)....second folder path to search for duplicate/similar images
    similarity (str)....."normal" = searches for duplicates, recommended setting, MSE < 200
                         "high" = serached for exact duplicates, extremly sensitive to details, MSE < 0.1
                         "low" = searches for similar images, MSE < 1000
    px_size (int)........recommended not to change default value
                         resize images to px_size height x width (in pixels) before being compared
                         the higher the pixel size, the more computational ressources and time required 
    sort_output (bool)...False = adds the duplicate images to output dictionary in the order they were found
                         True = sorts the duplicate images in the output dictionars alphabetically 
    show_output (bool)...False = omits the output and doesn't show found images
                         True = shows duplicate/similar images found in output            
    delete (bool)........! please use with care, as this cannot be undone
                         lower resolution duplicate images that were found are automatically deleted
    silent_del (bool)....! please use with care, as this cannot be undone
                         True = skips the asking for user confirmation when deleting lower resolution duplicate images
                         will only work if "delete" AND "silent_del" are both == True
    
    OUTPUT (set).........a dictionary with the filename of the duplicate images 
                         and a set of lower resultion images of all duplicates
    """
    start_time = time.time()

   
    if directory_B != None:
        # process both directories
        dif._process_directory(directory_A)
        dif._process_directory(directory_B)
    else:
        # process one directory
        dif._process_directory(directory_A)
        directory_B = directory_A

    all_directories_A = [directory_A]
    all_directories_B = [directory_B]

    #added by Kristofer from
    for path in Path(directory_A).iterdir():
        if path.is_dir():
            all_directories_A.append(path)

    for path in Path(directory_B).iterdir():
        if path.is_dir():
            all_directories_B.append(path)
    
    dif._validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del)

    for dif_A in all_directories_A:
        for dif_B in all_directories_B:

            directory_A = str(dif_A)
            directory_B = str(dif_B)
    #added by Kristofer to                    
                   
            if directory_B == directory_A:
                result, lower_quality = dif._search_one_dir(directory_A, 
                                                                similarity, px_size, sort_output, show_output, delete)
            else:
                result, lower_quality = dif._search_two_dirs(directory_A, directory_B, 
                                                                similarity, px_size, sort_output, show_output, delete)
                if len(lower_quality) != len(set(lower_quality)):
                    print("DifPy found that there are duplicates within directory A.")
                    
            if sort_output == True:
                result = collections.OrderedDict(sorted(result.items()))
            
            time_elapsed = np.round(time.time() - start_time, 4)
            
            self.result = result
            self.lower_quality = lower_quality
            self.time_elapsed = time_elapsed
            
            if len(result) == 1:
                images = "image"
            else:
                images = "images"
            print("Found", len(result), images, "with one or more duplicate/similar images in", time_elapsed, "seconds.")
            
            if len(result) != 0:
                if delete:
                    if not silent_del:
                        usr = input("Are you sure you want to delete all lower resolution duplicate images? \nThis cannot be undone. (y/n)")
                        if str(usr) == "y":
                            dif._delete_imgs(set(lower_quality))
                        else:
                            print("Image deletion canceled.")
                    else:
                        dif._delete_imgs(set(lower_quality))

                
        
def _search_one_dir(directory_A, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
    
    img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
    result = {}
    lower_quality = []   
    
    ref = dif._map_similarity(similarity)
    
    # find duplicates/similar images within one folder
    for count_A, imageMatrix_A in enumerate(img_matrices_A):
        for count_B, imageMatrix_B in enumerate(img_matrices_A):
            if count_B != 0 and count_B > count_A and count_A != len(img_matrices_A):      
                rotations = 0
                while rotations <= 3:
                    if rotations != 0:
                        imageMatrix_B = dif._rotate_img(imageMatrix_B)

                    err = dif._mse(imageMatrix_A, imageMatrix_B)
                    if err < ref:
                        if show_output:
                            dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                            dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                               str("..." + directory_A[-35:]) + "/" + filenames_A[count_B])
                        if filenames_A[count_A] in result.keys():
                            result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_A + "/" + filenames_A[count_B]]
                        else:
                            result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                "duplicates" : [directory_A + "/" + filenames_A[count_B]]
                                                               }
                        high, low = dif._check_img_quality(directory_A, directory_A, filenames_A[count_A], filenames_A[count_B])
                        lower_quality.append(low)                         
                        break
                    else:
                        rotations += 1    
    if sort_output == True:
        result = collections.OrderedDict(sorted(result.items()))
    return result, lower_quality            

def _search_two_dirs(directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):

    img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
    img_matrices_B, filenames_B = dif._create_imgs_matrix(directory_B, px_size)
    
    result = {}
    lower_quality = []   
    
    ref = dif._map_similarity(similarity)
        
    # find duplicates/similar images between two folders
    for count_A, imageMatrix_A in enumerate(img_matrices_A):
        for count_B, imageMatrix_B in enumerate(img_matrices_B):
            rotations = 0
            #print(count_A, count_B)
            while rotations <= 3:

                if rotations != 0:
                    imageMatrix_B = dif._rotate_img(imageMatrix_B)
                    
                err = dif._mse(imageMatrix_A, imageMatrix_B)
                #print(err)
                if err < ref:
                    if show_output:
                        dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                        dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                           str("..." + directory_B[-35:]) + "/" + filenames_B[count_B])
                    
                    if filenames_A[count_A] in result.keys():
                        result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_B + "/" + filenames_B[count_B]]
                    else:
                        result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                            "duplicates" : [directory_B + "/" + filenames_B[count_B]]
                                                           }
                    high, low = dif._check_img_quality(directory_A, directory_B, filenames_A[count_A], filenames_B[count_B])
                    lower_quality.append(low)                         
                    break
                else:
                    rotations += 1    
            
    if sort_output == True:
        result = collections.OrderedDict(sorted(result.items()))
    return result, lower_quality

def _process_directory(directory):
    # check if directories are valid
    directory += os.sep
    if not os.path.isdir(directory):
        raise FileNotFoundError(f"Directory: " + directory + " does not exist")
    return directory

def _validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del):
    # validate the parameters of the function
    if sort_output != True and sort_output != False:
        raise ValueError('Invalid value for "sort_output" parameter.')
    if show_output != True and show_output != False:
        raise ValueError('Invalid value for "show_output" parameter.')
    if similarity not in ["low", "normal", "high"]:
        raise ValueError('Invalid value for "similarity" parameter.')
    if px_size < 10 or px_size > 5000:
        raise ValueError('Invalid value for "px_size" parameter.')
    if delete != True and delete != False:
        raise ValueError('Invalid value for "delete" parameter.')   
    if silent_del != True and silent_del != False:
        raise ValueError('Invalid value for "silent_del" parameter.')   

def _create_imgs_matrix(directory, px_size):
    directory = dif._process_directory(directory)
    img_filenames = []
    # create list of all files in directory     
    folder_files = [filename for filename in os.listdir(directory)]

    # create images matrix   
    imgs_matrix = []
    for filename in folder_files:
        path = os.path.join(directory, filename)
        # check if the file is not a folder
        if not os.path.isdir(path):
            try:
                img = cv2.imdecode(np.fromfile(path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
                if type(img) == np.ndarray:
                    img = img[..., 0:3]
                    img = cv2.resize(img, dsize=(px_size, px_size), interpolation=cv2.INTER_CUBIC)
                    
                    if len(img.shape) == 2:
                        img = skimage.color.gray2rgb(img)
                    imgs_matrix.append(img)
                    img_filenames.append(filename)
            except:
                pass
    return imgs_matrix, img_filenames

def _map_similarity(similarity):
    if similarity == "low":
        ref = 1000
    # search for exact duplicate images, extremly sensitive, MSE < 0.1
    elif similarity == "high":
        ref = 0.1
    # normal, search for duplicates, recommended, MSE < 200
    else:
        ref = 200
    return ref

# Function that calulates the mean squared error (mse) between two image matrices
def _mse(imageA, imageB):
    err = np.sum((imageA.astype("float") - imageB.astype("float")) ** 2)
    err /= float(imageA.shape[0] * imageA.shape[1])
    return err

# Function that plots two compared image files and their mse
def _show_img_figs(imageA, imageB, err):
    fig = plt.figure()
    plt.suptitle("MSE: %.2f" % (err))
    # plot first image
    ax = fig.add_subplot(1, 2, 1)
    plt.imshow(imageA, cmap=plt.cm.gray)
    plt.axis("off")
    # plot second image
    ax = fig.add_subplot(1, 2, 2)
    plt.imshow(imageB, cmap=plt.cm.gray)
    plt.axis("off")
    # show the images
    plt.show()
    
# Function for printing filename info of plotted image files
def _show_file_info(imageA, imageB):
    print("""Duplicate files:\n{} and \n{}
    
    """.format(imageA, imageB))
    
# Function for rotating an image matrix by a 90 degree angle
def _rotate_img(image):
    image = np.rot90(image, k=1, axes=(0, 1))
    return image

# Function for checking the quality of compared images, appends the lower quality image to the list
def _check_img_quality(directoryA, directoryB, imageA, imageB):
    dirA = dif._process_directory(directoryA)
    dirB = dif._process_directory(directoryB)
    size_imgA = os.stat(dirA + imageA).st_size
    size_imgB = os.stat(dirB + imageB).st_size
    if size_imgA >= size_imgB:
        return directoryA + "/" + imageA, directoryB + "/" + imageB
    else:
        return directoryB + "/" + imageB, directoryA + "/" + imageA
    
# Function for deleting the lower quality images that were found after the search    
def _delete_imgs(lower_quality_set):
    deleted = 0
    for file in lower_quality_set:
        print("\nDeletion in progress...", end = "\r")
        try:
            os.remove(file)
            print("Deleted file:", file, end = "\r")
            deleted += 1
        except:
            print("Could not delete file:", file, end = "\r")
    print("\n***\nDeleted", deleted, "images.")

`

Ctrl-C doens't work

Hi! I launched my python script by accident and couldn't close it because Ctrl-C doesn't work while difPy is running :/

Search results' keys are just names, but sometimes in sub-folders

Hi there!
I have a folder like this:

folder/
| - IMG_202201.jpg
| - IMG_202202.jpg
| - subfolder/
|  | - IMG_202203.jpg

and i use it as first arg

i noticed that difPy.dif() search results give me just the file name... without the subfolder anyhow noted 😐

this broke my script with FileNotFoundError: [Errno 2] No such file or directory

[CHANGE REQUEST] replacing 'output directory' with 'move_path'

Hello. first of all I would like to thank you for creating and maintaining this project. It has certainly helped me finding a bunch of duplicate images through my enormous gallery.

I discovered this project 3/4 months ago. I needed a way for difPy.py to move my duplicate images to certain directories, but it was not possible. I edited the source code - which was really easy, having little to no Python experience prior to this.

As I recently wanted to make a pull request, I noticed that this repository had been updated, which meant that I had to update my version as well. Along with the updates, I noticed a new output_directory flag, which was only useful if using this program through the command line. I made my changes and would like to introduce my implementation.

Instead of the (now present) output_directory flag, I added move, silent_move and move_path as parameters to the __init__ function. Here are the details:

  • Their default values are (of course) false
  • move and silent_move would be further passed to the _validate_parameters() function
  • After processing directory_A and directory_B, if move was set to true, the move_path would be validated - checked if it was equal to directory_A and/or directory_B, and it would be further passed to the _process_directory() function
  • An appropriate prompt for the silent_move parameter
  • In the _validate_parameters() function, move and delete can not be both true, as well as move and silent_move accepting only boolean values
  • A _move_imgs() funcion, similar to _delete_imgs(), with appropriate behavior
  • -m, --move, -M, --silent_move, -mp and --move-path CLI flags

The currently implemented output_directory flag only works for the CLI, but not for python scripts, as it is not passed over to the __init__ funcion. As a result, I have removed the output_directory flag and replaced it with my move implementation. This version takes both the command line and scripts in mind.

I would be happy to submit a pull request with my changes, If this idea sounds good to you, so you can take a better look at how these changes would be implemented.

Looking forward to collaborating and contributing to this project as much as I can.

Enhancement: ignore files where file extension is not of a known image type

At present difPy evaluates every file in folders its pointed to, ascertaining whether or not it is an image. When processing folders that contain many files, most of which are not images it means the process takes a lot longer than it needs to as every file is assessed.

It would be very useful to be able to call dif with a parameter that tells it to only compare files where the file is a known image type, thus ignoring all files with extensions that are not associated with images.

image size limit?

Hello, thanks for this, it's very useful.

I'm having a problem analysing this folder where I have 512x512px transparent .pngs where only 576px are black.
The black pixels that form my paths should be different in every image but difPY says they are all the same, see output attached.
I tried setting similarity to low but no luck. Are my images too small or maybe the differences are too subtle?
Is there anything I'm missing here?

Thanks for your help

Screenshot 2022-01-02 at 23 14 25

search.delete() always fails (even with matches); nested search.lower_quality dictionary

Hello!

I noticed that delete always fails even though stats and result show matching files.
This is because the lower_quality dictionary ends up nested somehow.

The workaround is to correct that nesting before calling delete.

dif = difPy.build(some_directory)
search = difPy.search(dif)

search.lower_quality = search.lower_quality["lower_quality"]  # the workaround

search.delete(silent_del=True)

I hope to find the time to contribute a fix, but maybe its a trivial fix for some of you?
Anyway, thanks for this well written and awesome library!

Clustering Possible?

Is it possible to group similar images by extending the code in this library?

Incorrect results and a few further observations

@elisemercury , I've just pulled and tested your latest commit and have encountered what I assume are bugs:

running python /home/x/git/Duplicate-Image-Finder/difPy/dif.py --directory /mnt/sdc/2tag/ --output_directory /tmp --recursive True --limit_extensions True --show_progress True:

  • In one instance a 3000x3000 file was marked as lower quality than a much smaller image with otherwise same properties
    image

Edited extract from /tmp/difPy_20230927222221_lower_quality.json:

{"lower_quality": ["/pathtofile/xfolder.jpg"]}
  • if there are x identical (i.e. their md5sum is identical) files of lower quality in the same folder and one of superior quality, difpy only flags one of the lower quality files rather than all of them

  • as an observation: perusal of stats.json shows many instances of "ImageFilterWarning: invalid image extension." signifying to me that these non-image files are still being assessed rather than behaving according to the --limit_extensions True switch shown above. Thus it looks like there's a further opportunity to enhance performance by ignoring non-image extensions.

Enhancement - Optional parameter set for source folder / comparison folder mode

My apologies if this is already possible, but I didn't find any method within the documentation at https://difpy.readthedocs.io/en/latest/usage.html.

Currently, the multiple folder method within difPy searches for duplicates amongst all the folders listed. However, once you have de-duplicated the images within a folder, if you then look to compare an additional folder against those that you have already de-duplicated, the de-duplicated folder's contents again get compared against themselves.

It would be nice if there was an optional parameter set where a source folder could be set that multiple comparison folders could check against. Basically, difPy would assume the image contents in the source folder are unique, and only need to be processed against duplicates within the provided comparison folders.

I believe this would help in larger image projects like mine. In my scenario, I've downloaded photos from my partner and my phone's and tablets multiple times throughout the years. As I started de-duplicating with difPy, I would move the de-duplicated files into a central project folder and then scan the next photo dump folder against the project folder. Since the project folder already contains unique images, I don't need difPy to check those images against each other again, but I'm not seeing any existing method to do that. I think this would be a noticeable performance improvement, especially as image sets get larger and larger.

Some images skipped

I was surprised to find that difPy skipped my entire directory of jpeg images. I tracked this down to the call to imghdr.what(). This looks for the string "jpeg" or "exif" near the start of the file. These are not present in my images, so it seems that imghdr.what() cannot be relied upon.

I updated my local copy of the code to do this instead. I expect cv2.imdecode is much better at determining whether a file is a valid image or not.

        img_path = os.path.join(directory, filename)
        if not os.path.isdir(img_path):
            try:
                img = cv2.imdecode(np.fromfile(img_path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)

[...]
except Exception:
pass

Again, a suggestion and not a request.

Any way to determine progress?

Hiya,

I figured I'd see if this package could replace the (paid) "Duplicate Photo Finder" app I'm currently using, which I'm using to weed duplicates from a dir with, typically, about 10,000 images in it. However, without any feedback on how much work has already been done/is still expected, it's very hard to tell if this is going to run over the course of a few minutes (which is fine) or over the course of a few hours (which would not be fine =D).

Is there any way to coax out the progress information?

Sort files by date/name/etc

Hi! Thanks for your library, super useful.

I have a suggestion regarding the ordering of the files previous to checking. Having a way to manipulate the order of the files to be checked would be a nice addition for me, as I'm adding new files and I'd like to want to know if these recent new files are repeated, rather than the names they have.

For instance 4 different ordering mechanisms could be added:

dif.ORDER_BY_NAME_DESC # default
dif.ORDER_BY_NAME_ASC # ascendant names
dif.ORDER_BY_DATE_DESC
dif.ORDER_BY_DATE_ASC

by default, compare_images can have a param like order=dif.ORDER_BY_NAME_DESC. And can be updated by the user.
Do you like that behavior?

Greetings! :)

Minimum requirements not met

Running dif.py where import packages are not found hard halts the program. I am not sure if it is intentional, but running this without having say matplotlib installed in my case, throws a "ModuleNotFoundError" and stops the program. Not sure if there is a more graceful way to handle missing requirements or not. May want to consider adding any "non standard" modules to the documentation pages so that they can be downloaded before attempting usage. If these are in the documentation somewhere already, I apologize that I did not find it prior to posting this.

Slow to compare

I tried using this on a folder of 100,000 1280x720 images and is exceptionally slow (about 12 seconds per image at the comparison stage). At this rate it's going to run for two weeks! Is there anything you can do to speed things up?

Multi-processing

I am currently working on making this project multithreaded, as I have many folders with tens of thousands of images(perhaps 100k+), and am wanting a slightly faster option.

Opening this as a means of communication. If you have a discord account/email that would work better, as I will likely see that before a github issue comment.
My discord account is thecodingchicken#4835 if you would prefer to reach out there.

max depth

It'll be nice to have option to limit the depth of a recursive on folder scans

Match Single Image with Read-Only Directory

Dear Developer,

Am a noob but still love programming (have just started) so excuse me if anything below is "obvious" or "incorrectly stated".

I got the gist that this will match all files in the given directory for similarity.

First Point: Is it possible to match an image (file path to pass as parameter) against a directory path (folder path to pass as parameter)? Which Means that instead of Matching all Images against all images, we could match just one image against all images of a folder.

Second Point: Is the function writing something in the Search folder (like tensor Data or anything)? Am asking to understand if this can work in read-only directory or not. (I tried reading the code but could not figure it out)

Third point: If we have to run / call it multiple times on a large folder then would it be taking long time analyzing all files each time or is it possible to provide / pass a path to file / folder where it can save the analysis to save the time?

Example: (No text in below lines is crossed so please do not ignore if any text is coming crossed. I could not figure out why is it applying this formatting")

Input_file_path = "/Downloads/image.jpg" # Any valid Image File
Target_Folder_path = "
/A_Readonly_Folder_of_Images" # A Read-only folder with say 56,000 (big number ?) files to search from.
Working_File_or_Folder_path: "~/A_File_or_Folder_with_Read_Write_Access" # A Write access enabled file / folder to save analysis data to / from.
E.g.
If the passed parameter file / folder does not exist then create one and save analysis data.
If the passed parameter file / folder does exist then read it and use it instead of analyzing the Target Folder again
#calling
dif.compare_image(Input_file_path,Target_folder_path,Working_Folder_path)

Please excuse me if am crossing any limits here. I just became curious about this wonderful concept but I know nothing about github and how it works.

Best Regards
Ashish

All my images are considered invalid

I try to compare 133 .png pictures but they are all considered invalid by the script.

I don't understand where this can come from because the script was working with these same pictures before I downloaded the latest version (today).

Moreover I can open them with Windows.

Thanks for your work anyway!

search = difPy.dif(r"D:\Fichiers conservés\PHOTOS\PNG\1080x1920")

print(search.stats)

{'directory': ['D:\\Fichiers conservés\\PHOTOS\\PNG\\1080x1920'], 'duration': {'start_date': '2023-02-26', 'start_time': '13:21:26', 'end_date': '2023-02-26', 'end_time': '13:21:34', 'seconds_elapsed': 7.862}, 'fast_search': True, 'recursive': True, 'match_mse': 200, 'files_searched': 0, 'matches_found': 0, 'invalid_files': 133}

from: can't read /var/mail/difPy

I'm trying to run a simple script based on the example:

from difPy import dif
search = dif("/Users/myname/Pictures/emoji/")

It doesn't work, and I get:

from: can't read /var/mail/difPy

On macOS 12.4, using Python 3.9.10, just installed the package today.

Path handling improvements

Looking at the code, I see that paths are often joined using "+" and that path separators are assumed to be '/' are put in as literal text. I needed to put some workarounds in my calling code (on Windows 11) to handle this. It would be more portable and maintainable to use os.path.join() and other portable path handling functions.

Just a suggestion, not a request.

v2.4 does not detect duplicates well

Hi @elisemercury, thank you for your work.

I am seeing that with version 2.4 something has changed so that the duplicate detection no longer works properly. I don't know what it could be.
The output format hasn't changed, has it? Is it necessary to adapt our code to the new version?

query about json

Hi,
I'm trying to use the json output from search as a variable and using data = json.loads(str(search.result)) but it's failing error Expecting property name enclosed in double quotes: line 1 column 2 (char 1) . I just wondered if you can only use json with file output, and if so, why ?

Cheers
Simon

CL interface

Would be lovely to be able to invoke this directly from command line.

Launching dif.py with the parameters below causes it to terminate

$ python dif.py --directory /qnap/qnap2/zzl/ --output_directory /tmp --recursive True --limit_extensions True --show_progress True

Gave rise to:

Traceback (most recent call last):
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 633, in <module>
    dif = build(args.directory, recursive=args.recursive, in_folder=args.in_folder, limit_extensions=args.limit_extensions,px_size=args.px_size, show_progress=args.show_progress, logs=args.logs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 52, in __init__
    self._tensor_dictionary, self._filename_dictionary, self._id_to_group_dictionary, self._group_to_id_dictionary, self._invalid_files, self._stats = self._main()
                                                                                                                                                       ^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 62, in _main
    valid_files, skipped_files = self._get_files()
                                 ^^^^^^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 125, in _get_files
    valid_files, skip_files = self._validate_files(files)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 135, in _validate_files
    valid_files, skip_files = self._filter_extensions(valid_files)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 143, in _filter_extensions
    extensions = [file.split(".")[1].lower() for file in directory_files]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/git/Duplicate-Image-Finder/difPy/dif.py", line 143, in <listcomp>
    extensions = [file.split(".")[1].lower() for file in directory_files]
                  ~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Multi-threading

Hi!
I have nice AMD cpu with 8 cores. And when I'm searching thorough 2 big folders, it takes a lof of time because only one of them is being use

Dividing the work into multiple threads seems as obvious task in this library - would be awesome if you implemented it! (or suggested how it could be done for someone to pull request)

run the CLI, how?

Hello,

call me stupid but I try to run the cli version of this code, I can run it from a basic script:
from difPy import dif
search = dif("C:/Path/to/Folder/")

and this works.
but if I run it as python dif.py -A "C:/Path/to/Folder_A/"

I get a no such file or directory

And yes, not very familiar with python (yet)

Kind Regards,

Gerrit Kuilder

MemoryError

Aparently there is no memory limit built in and it will eat as much as it can get from windows.
"Preparing Files" completes fine but when searching the differences it eats a lot of memory.
My Windows is set up to automatically manage the pagefile and will happily enlarge it until the drive it's located on is full.
When that happens following error appears:

Traceback (most recent call last):
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 642, in <module>
    se = search(dif, similarity=args.similarity)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 242, in __init__
    self.result = self._main()
                  ^^^^^^^^^^^^
  File "G:\_DOWNLOADS\Duplicate-Image-Finder-4.0.1\difPy\dif.py", line 284, in _main
    items.append((id_a, id_b, self.__difpy_obj._tensor_dictionary[id_a], self.__difpy_obj._tensor_dictionary[id_b]))
MemoryError

The program will continue to run but stop without any further error after some minutes. There is no log file produced.

How to reproduce:
Run difPy in command line (similarity s=90) on a directory with ~60.000 images with 16GB ram and limited hard drive space for the pagefile.
Let windows manage the pagefile size.
Wait for "preparing files" to complete (will take an hour or so).

I don't know what would happen if the pagefile has a fixed size. I assume the same error will appear.

System:
Windows 10
16GB Ram
difPy 4.0.1

Fail to detect pictures compressed to a lower resolution

I find it fails to detect pictures compressed to a lower resolution against the original pictures.

Very easy to duplicate. Here is a high-resolution picture. (picture size 1.5 mb)

14541697714511_ pic_hd

I compress it to a low-resolution picture (picture size 572kb)

14541697714511_ pic_hd

But it fails to detect them as duplicated.

gracefully skip over deleted files

sort of related to #16, running this on 10,000 images while another process autowrites/moves data into and out of the same dir once an hour may cause difPy to try to load in images that existed when it built its file list, but not once it actually gets to that file. right now, that causes it to hard-crash:

Traceback (most recent call last):
  File "d:\temp\diftest.py", line 4, in <module>
    search = dif("./inbox/_reviewed")
  File "d:\temp\venv\lib\site-packages\difPy\dif.py", line 50, in __init__
    result, lower_quality = dif._search_one_dir(directory_A,
  File "d:\temp\venv\lib\site-packages\difPy\dif.py", line 113, in _search_one_dir
    high, low = dif._check_img_quality(directory_A, directory_A, filenames_A[count_A], filenames_A[count_B])
  File "d:\temp\venv\lib\site-packages\difPy\dif.py", line 261, in _check_img_quality
    size_imgA = os.stat(dirA + imageA).st_size
FileNotFoundError: [WinError 2] The system cannot find the file specified: './inbox/_reviewed\\rz3b9hl0yio81.jpg'

Instead, it should probably just go skipping rz3b9hl0yio81.jpg: could not find file (did it get moved/deleted?) and keep running

ValueError.

Hi there,

I'm trying to run this code on folder with more than 80k images:

Traceback (most recent call last):
  File ".\difpy.py", line 3, in <module>
    dif.compare_images("PATH TO FOLDER")
  File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 35, in compare_images
    imgs_matrix = dif.create_imgs_matrix(directory, px_size)
  File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 121, in create_imgs_matrix
    imgs_matrix = np.concatenate((imgs_matrix, img))
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)

what am i doing wrong?

Thanks in advance

feature request: chunking of source folder

Thank you for your library! Just giving a heads up that I edited one of your previous versions by adding an additional parameter that allows the src folder to be split into n chunks for processing. Scenario: I have image folders that contain over 50000 images in sequential time over.

For me, it is most likely that an image file is going to be a duplicate with other image files added around a similar time frame. Comparing against the entire 50000+ for each image took an enormous amount of time. So, I made it so that I could split the folder into chunks of 5000 (for example) and evaluate in sections. It also allowed me to restart from a position if I had to stop evaluation for some reason. There's a little more that I added to make it more robust (for example, for n+1 chunk would also include some amount of files from the previous chunk so that there would be some degree of overlap). Anyway, this worked out well for me and if you are still adding to this library then I found it to be very useful.

The route I took is not going to be as robust as going through EVERY image each time but in my personal tests, the performance was close enough and the time savings were significant! Cheers,

Silent deletion and subfolder traversal

I'm planning to use this in a semi-automated situation and it would be nice to have the option to suppress the user confirmation for silent deletion.

Will lift some code for now and adapt for my use case. Thanks for this project!

Also, it could be interesting to add subfolder traversal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.