erdogant / undouble Goto Github PK

Python package undouble is to detect (near-)identical images.

License: BSD 3-Clause "New" or "Revised" License

Shell 0.92% Python 99.08%

image-recognition doubles-detector hash photos image-similarity image phash dhash ahash wavelet

undouble's Introduction

undouble

The aim of undouble is to detect (near-)identical images. It works using a multi-step process of pre-processing the images (grayscaling, normalizing, and scaling), computing the image hash, and the grouping of images. A threshold of 0 will group images with an identical image hash. The results can easily be explored by the plotting functionality and images can be moved with the move functionality. When moving images, the image in the group with the largest resolution will be copied, and all other images are moved to the undouble subdirectory. In case you want to cluster your images, I would recommend reading the blog and use the clustimage library.

The following steps are taken in the undouble library:

Read recursively all images from directory with the specified extensions.
Compute image hash.
Group similar images.
Move if desired.

⭐️ Star this repo if you like it ⭐️

Blogs

Read the blog to get a structured overview how to detect duplicate images using image hash functions.

Documentation pages

On the documentation pages you can find detailed information about the working of the undouble with many examples.

Installation

It is advisable to create a new environment (e.g. with Conda).

conda create -n env_undouble python=3.8
conda activate env_undouble

Install bnlearn from PyPI

pip install undouble            # new install
pip install -U undouble         # update to latest version

Directly install from github source

pip install git+https://github.com/erdogant/undouble

Import Undouble package

from undouble import Undouble

Examples:

Example: Grouping similar images of the flower dataset

Example: List all file names that are identifical

Example: Moving similar images in the flower dataset

# -------------------------------------------------
# >You are at the point of physically moving files.
# -------------------------------------------------
# >[7] similar images are detected over [3] groups.
# >[4] images will be moved to the [undouble] subdirectory.
# >[3] images will be copied to the [undouble] subdirectory.

# >[C]ontinue moving all files.
# >[W]ait in each directory.
# >[Q]uit
# >Answer: w

Example: Plot the image hashes

Example: Three different imports

The input can be the following three types:

* Path to directory
* List of file locations
* Numpy array containing images

Example: Finding identical mnist digits

Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Erdogan Taskesen, github: erdogant

Contribute

All kinds of contributions are welcome!
If you wish to buy me a Coffee for this work, it is very appreciated :)

Licence

See LICENSE for details.

Other interesting stuf

undouble's People

Contributors

Stargazers

Watchers

undouble's Issues

Is there any way to run undouble in GPU?

I need to run undouble library with a huge amount of images, It's good if the code is supported in GPU.

Request for a minimal reproducible example of tabluated results.

Can a minimal reproducible example be provided where undouble is used where:

input: a folder path of images, path of image we care about
output: a table / pandas dataframe with the name of all the images in the folder path where ranked by most similar to least similar.

Not able to detect all the duplicate images

Not all the duplicates are detected.

Require feature to generate hash of single cv2.image

It would be beneficial if you guys create one feature that can directly generate a hash from a single image NumPy array. Currently I am trying to create a hash but cannot do the same.

simple mistake found in undouble.py

def compute_hash(self, method=None, hash_size=None, return_dict=False):

....

    if method=='whash-haar':
        if (np.ceil(np.log2(hash_size)) != np.floor(np.log2(hash_size))):
            logger.error('hash_size should be power of 2 (8, 16, 32, 64, ..., etc)')
            return None

Default value for hash_size (None) is passed to np.log2(hash_size) when hash_size is not given.

? Error with `Undouble.group` function and numpy array handling

Hi, I'm interested in matching some compressed photos (extracted from Excel files) to their original high-quality copy, and I came across your package. I've given it a try, setting up a new environment with Miniconda/Python3 and installing undouble via Pip. I followed the workflow outlined in your documentation by importing and computing the image hashes:

import os
from pathlib import Path
from undouble import Undouble

dir_in = "inputs"
dir_out = "outputs"

#' directory including all photos (including compressed ones)
dir_photos = os.path.join(dir_in, "Photos")

#' initialize with default settings
model = Undouble(method = "ahash")

#' import the Excel photos
model.import_data(targetdir = dir_photos)

#' compute image hash
model.compute_hash()

#' hashes look blocky and simplistic; are they good enough for matching?
model.plot_hash(idx = 10)

#' group images
model.group(threshold = 0)

However, model.group returns an error:

Traceback (most recent call last):

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\bcaradima\projects\1-imager.py:59
    model.group(threshold = 0)

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\undouble\undouble.py:257 in group
    self.results['select_pathnames'] = np.array(pathnames)[idx].tolist()

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part.

It appears to be related to numpy array manipulation; I'm using Python 3.8 and numpy version 1.24.3. I've verified that numpy is up-to-date, but I'm wondering if there's a specific version that undouble needs to work? Thank you for your help.

BTW, does it make sense to store the compressed images in with the rest of the photos, or would you suggest an alternative approach?

Thanks again.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I am facing this error on model.import_data(targetdir)

Each image size is around 5-8 mb.

does the image size causes this error?

model = Undouble(method='phash',hash_size=8)
model.import_data(targetdir)

[undouble] >INFO> Extracting images from: [./images/set1/]
[undouble] >INFO> [6032] files are collected recursively from path: [./images/set1/]
[undouble] >INFO> [6032] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
3%|▎ | 203/6032 [00:26<07:00, 13.85it/s]Corrupt JPEG data: 36 extraneous bytes before marker 0xd9
4%|▍ | 239/6032 [00:31<18:40, 5.17it/s]Corrupt JPEG data: 117 extraneous bytes before marker 0xd9
5%|▌ | 313/6032 [00:41<14:34, 6.54it/s]Corrupt JPEG data: 90 extraneous bytes before marker 0xd9
9%|▊ | 521/6032 [01:03<08:07, 11.30it/s]Corrupt JPEG data: 1895 extraneous bytes before marker 0xd9
11%|█ | 661/6032 [01:17<11:03, 8.10it/s]Corrupt JPEG data: 50 extraneous bytes before marker 0xd9
13%|█▎ | 795/6032 [01:32<08:00, 10.90it/s]Corrupt JPEG data: 47 extraneous bytes before marker 0xd9
18%|█▊ | 1069/6032 [02:03<08:30, 9.72it/s]Corrupt JPEG data: 1076 extraneous bytes before marker 0xd9
19%|█▉ | 1164/6032 [02:11<05:45, 14.08it/s]Invalid SOS parameters for sequential JPEG
21%|██ | 1241/6032 [02:20<12:14, 6.52it/s]Corrupt JPEG data: 51 extraneous bytes before marker 0xd9
26%|██▌ | 1554/6032 [02:57<15:41, 4.76it/s]Corrupt JPEG data: 102 extraneous bytes before marker 0xd9
28%|██▊ | 1669/6032 [03:11<08:54, 8.16it/s]Corrupt JPEG data: 116 extraneous bytes before marker 0xd9
29%|██▉ | 1747/6032 [03:21<06:43, 10.62it/s]Corrupt JPEG data: 1487 extraneous bytes before marker 0xd9
32%|███▏ | 1930/6032 [03:42<05:48, 11.78it/s]Corrupt JPEG data: 43 extraneous bytes before marker 0xd9
35%|███▌ | 2130/6032 [04:04<05:06, 12.73it/s]Corrupt JPEG data: 89 extraneous bytes before marker 0xd9
36%|███▋ | 2196/6032 [04:12<05:38, 11.34it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/123.jpg]
37%|███▋ | 2248/6032 [04:18<06:33, 9.62it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/215.jpg]
44%|████▍ | 2666/6032 [05:15<05:57, 9.41it/s]Invalid SOS parameters for sequential JPEG
46%|████▋ | 2791/6032 [05:32<05:42, 9.46it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/322.jpg]
59%|█████▉ | 3577/6032 [07:22<03:31, 11.61it/s]Invalid SOS parameters for sequential JPEG
61%|██████▏ | 3695/6032 [07:36<04:37, 8.43it/s]Corrupt JPEG data: 1640 extraneous bytes before marker 0xd9
65%|██████▍ | 3913/6032 [08:06<07:28, 4.73it/s]Corrupt JPEG data: 92 extraneous bytes before marker 0xd9
65%|██████▌ | 3945/6032 [08:10<04:11, 8.28it/s]Corrupt JPEG data: 1894 extraneous bytes before marker 0xd9
76%|███████▌ | 4578/6032 [09:31<03:15, 7.43it/s]Corrupt JPEG data: 38 extraneous bytes before marker 0xd9
79%|███████▉ | 4754/6032 [09:52<01:43, 12.40it/s]Invalid SOS parameters for sequential JPEG
86%|████████▌ | 5193/6032 [10:46<02:04, 6.76it/s]Corrupt JPEG data: 41 extraneous bytes before marker 0xd9
86%|████████▌ | 5196/6032 [10:47<01:33, 8.98it/s]Corrupt JPEG data: 41 extraneous bytes before marker 0xd9
86%|████████▋ | 5205/6032 [10:47<01:16, 10.86it/s]Corrupt JPEG data: 364 extraneous bytes before marker 0xd2
94%|█████████▍| 5690/6032 [11:43<00:41, 8.20it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/445.jpg]
97%|█████████▋| 5855/6032 [12:03<00:13, 13.48it/s]Invalid SOS parameters for sequential JPEG
98%|█████████▊| 5891/6032 [12:09<00:13, 10.52it/s]Corrupt JPEG data: 98 extraneous bytes before marker 0xd9
100%|██████████| 6032/6032 [12:27<00:00, 8.07it/s]

Traceback (most recent call last):
File "app.py", line 64, in
model.import_data(targetdir)
File "/root/.local/lib/python3.6/site-packages/undouble/undouble.py", line 142, in import_data
self.results = self.clustimage.import_data(self.params['targetdir'], black_list=black_list)
File "/root/.local/lib/python3.6/site-packages/clustimage/clustimage.py", line 997, in import_data
X = self.preprocessing(Xraw, grayscale=self.params['cv2_imread_colorscale'], dim=self.params['dim'], flatten=flatten)
File "/root/.local/lib/python3.6/site-packages/clustimage/clustimage.py", line 833, in preprocessing
if np.where(np.array(list(map(len, img)))<min_nr_pixels)[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

import_data from http image_url

model = Undouble(method='phash', hash_size=8)
targetdir = [
'https://www.gardendesign.com/pictures/images/675x529Max/site_3/helianthus-yellow-flower-pixabay_11863.jpg'
]
model.import_data(targetdir)
model.compute_hash()

could you add a new feature of importing data directly from http image url not using downloading ono disk.

Thank you.

whash-db4 and crop-resistant-hash not working

Hello,

I've seen in the docs on readthedocs that in addition to ahash, phash, dhash and whash-haar, there are also whash-db4 and crop-resistant-hash.
I'm really interested in the latest but it appears to be not working. I get a "cannot compute hash" error Is it not implemented yet ?

Error when importing from URL

Hello, I haven't been able to get this library working with a list of URLs as the image source.
Running the example code from the docs: Import images from url location fails with the following trace:

[/usr/local/lib/python3.10/dist-packages/undouble/undouble.py](https://localhost:8080/#) in import_data(self, targetdir, black_list, return_results)
    165         # logger.info("Retrieving files from: [%s]" %(self.params['targetdir']))
    166         # Preprocessing the images the get them in the right scale and size etc
--> 167         self.results = self.clustimage.import_data(self.params['targetdir'], black_list=black_list)
    168         # Remove keys that are not used.
    169         if 'labels' in self.results: self.results.pop('labels')

[/usr/local/lib/python3.10/dist-packages/clustimage/clustimage.py](https://localhost:8080/#) in import_data(self, Xraw, flatten, black_list)
    992             Xraw = url2disk(Xraw, self.params['tempdir'])
    993             # Do not store in the object if the find functionality is used
--> 994             X = self.preprocessing(Xraw['pathnames'], grayscale=self.params['cv2_imread_colorscale'], dim=self.params['dim'], flatten=flatten)
    995             # Add the url location
    996             if Xraw['url'] is not None:

[/usr/local/lib/python3.10/dist-packages/clustimage/clustimage.py](https://localhost:8080/#) in preprocessing(self, pathnames, grayscale, dim, flatten)
    803         # Read and preprocess data
    804         imgs = list(map(lambda x: self.imread(x, colorscale=grayscale, dim=dim, flatten=flatten, return_succes=True), tqdm(pathnames, disable=disable_tqdm(), desc='[clustimage]')))
--> 805         img, imgOK = zip(*imgs)
    806         img = np.array(img)
    807 

ValueError: not enough values to unpack (expected 2, got 0)

Thanks

/tmp folder clean up needed