Code Monkey home page Code Monkey logo

undouble's Introduction

undouble

Python PyPI Version License Github Forks GitHub Open Issues Project Status Sphinx Downloads Downloads Sphinx

The aim of undouble is to detect (near-)identical images. It works using a multi-step process of pre-processing the images (grayscaling, normalizing, and scaling), computing the image hash, and the grouping of images. A threshold of 0 will group images with an identical image hash. The results can easily be explored by the plotting functionality and images can be moved with the move functionality. When moving images, the image in the group with the largest resolution will be copied, and all other images are moved to the undouble subdirectory. In case you want to cluster your images, I would recommend reading the blog and use the clustimage library.

The following steps are taken in the undouble library:

  • Read recursively all images from directory with the specified extensions.
  • Compute image hash.
  • Group similar images.
  • Move if desired.

⭐️ Star this repo if you like it ⭐️

Blogs

On the documentation pages you can find detailed information about the working of the undouble with many examples.

Installation

It is advisable to create a new environment (e.g. with Conda).
conda create -n env_undouble python=3.8
conda activate env_undouble
Install bnlearn from PyPI
pip install undouble            # new install
pip install -U undouble         # update to latest version
Directly install from github source
pip install git+https://github.com/erdogant/undouble
Import Undouble package
from undouble import Undouble

Examples:

# -------------------------------------------------
# >You are at the point of physically moving files.
# -------------------------------------------------
# >[7] similar images are detected over [3] groups.
# >[4] images will be moved to the [undouble] subdirectory.
# >[3] images will be copied to the [undouble] subdirectory.

# >[C]ontinue moving all files.
# >[W]ait in each directory.
# >[Q]uit
# >Answer: w

The input can be the following three types:

* Path to directory
* List of file locations
* Numpy array containing images


Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Contribute

  • All kinds of contributions are welcome!
  • If you wish to buy me a Coffee for this work, it is very appreciated :)

Licence

See LICENSE for details.

Other interesting stuf

undouble's People

Contributors

erdogant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

undouble's Issues

Request for a minimal reproducible example of tabluated results.

Can a minimal reproducible example be provided where undouble is used where:

input: a folder path of images, path of image we care about
output: a table / pandas dataframe with the name of all the images in the folder path where ranked by most similar to least similar.

simple mistake found in undouble.py

def compute_hash(self, method=None, hash_size=None, return_dict=False):

....

    if method=='whash-haar':
        if (np.ceil(np.log2(hash_size)) != np.floor(np.log2(hash_size))):
            logger.error('hash_size should be power of 2 (8, 16, 32, 64, ..., etc)')
            return None

Default value for hash_size (None) is passed to np.log2(hash_size) when hash_size is not given.

? Error with `Undouble.group` function and numpy array handling

Hi, I'm interested in matching some compressed photos (extracted from Excel files) to their original high-quality copy, and I came across your package. I've given it a try, setting up a new environment with Miniconda/Python3 and installing undouble via Pip. I followed the workflow outlined in your documentation by importing and computing the image hashes:

import os
from pathlib import Path
from undouble import Undouble

dir_in = "inputs"
dir_out = "outputs"

#' directory including all photos (including compressed ones)
dir_photos = os.path.join(dir_in, "Photos")

#' initialize with default settings
model = Undouble(method = "ahash")

#' import the Excel photos
model.import_data(targetdir = dir_photos)

#' compute image hash
model.compute_hash()

#' hashes look blocky and simplistic; are they good enough for matching?
model.plot_hash(idx = 10)

#' group images
model.group(threshold = 0)

However, model.group returns an error:

Traceback (most recent call last):

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\bcaradima\projects\1-imager.py:59
    model.group(threshold = 0)

  File ~\AppData\Local\miniconda3\envs\env_undouble\lib\site-packages\undouble\undouble.py:257 in group
    self.results['select_pathnames'] = np.array(pathnames)[idx].tolist()

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part.

It appears to be related to numpy array manipulation; I'm using Python 3.8 and numpy version 1.24.3. I've verified that numpy is up-to-date, but I'm wondering if there's a specific version that undouble needs to work? Thank you for your help.

BTW, does it make sense to store the compressed images in with the rest of the photos, or would you suggest an alternative approach?

Thanks again.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I am facing this error on model.import_data(targetdir)

Each image size is around 5-8 mb.

does the image size causes this error?

model = Undouble(method='phash',hash_size=8)
model.import_data(targetdir)

[undouble] >INFO> Extracting images from: [./images/set1/]
[undouble] >INFO> [6032] files are collected recursively from path: [./images/set1/]
[undouble] >INFO> [6032] images are extracted.
[undouble] >INFO> Reading and checking images.
[undouble] >INFO> Reading and checking images.
3%|▎ | 203/6032 [00:26<07:00, 13.85it/s]Corrupt JPEG data: 36 extraneous bytes before marker 0xd9
4%|▍ | 239/6032 [00:31<18:40, 5.17it/s]Corrupt JPEG data: 117 extraneous bytes before marker 0xd9
5%|▌ | 313/6032 [00:41<14:34, 6.54it/s]Corrupt JPEG data: 90 extraneous bytes before marker 0xd9
9%|▊ | 521/6032 [01:03<08:07, 11.30it/s]Corrupt JPEG data: 1895 extraneous bytes before marker 0xd9
11%|█ | 661/6032 [01:17<11:03, 8.10it/s]Corrupt JPEG data: 50 extraneous bytes before marker 0xd9
13%|█▎ | 795/6032 [01:32<08:00, 10.90it/s]Corrupt JPEG data: 47 extraneous bytes before marker 0xd9
18%|█▊ | 1069/6032 [02:03<08:30, 9.72it/s]Corrupt JPEG data: 1076 extraneous bytes before marker 0xd9
19%|█▉ | 1164/6032 [02:11<05:45, 14.08it/s]Invalid SOS parameters for sequential JPEG
21%|██ | 1241/6032 [02:20<12:14, 6.52it/s]Corrupt JPEG data: 51 extraneous bytes before marker 0xd9
26%|██▌ | 1554/6032 [02:57<15:41, 4.76it/s]Corrupt JPEG data: 102 extraneous bytes before marker 0xd9
28%|██▊ | 1669/6032 [03:11<08:54, 8.16it/s]Corrupt JPEG data: 116 extraneous bytes before marker 0xd9
29%|██▉ | 1747/6032 [03:21<06:43, 10.62it/s]Corrupt JPEG data: 1487 extraneous bytes before marker 0xd9
32%|███▏ | 1930/6032 [03:42<05:48, 11.78it/s]Corrupt JPEG data: 43 extraneous bytes before marker 0xd9
35%|███▌ | 2130/6032 [04:04<05:06, 12.73it/s]Corrupt JPEG data: 89 extraneous bytes before marker 0xd9
36%|███▋ | 2196/6032 [04:12<05:38, 11.34it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/123.jpg]
37%|███▋ | 2248/6032 [04:18<06:33, 9.62it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/215.jpg]
44%|████▍ | 2666/6032 [05:15<05:57, 9.41it/s]Invalid SOS parameters for sequential JPEG
46%|████▋ | 2791/6032 [05:32<05:42, 9.46it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/322.jpg]
59%|█████▉ | 3577/6032 [07:22<03:31, 11.61it/s]Invalid SOS parameters for sequential JPEG
61%|██████▏ | 3695/6032 [07:36<04:37, 8.43it/s]Corrupt JPEG data: 1640 extraneous bytes before marker 0xd9
65%|██████▍ | 3913/6032 [08:06<07:28, 4.73it/s]Corrupt JPEG data: 92 extraneous bytes before marker 0xd9
65%|██████▌ | 3945/6032 [08:10<04:11, 8.28it/s]Corrupt JPEG data: 1894 extraneous bytes before marker 0xd9
76%|███████▌ | 4578/6032 [09:31<03:15, 7.43it/s]Corrupt JPEG data: 38 extraneous bytes before marker 0xd9
79%|███████▉ | 4754/6032 [09:52<01:43, 12.40it/s]Invalid SOS parameters for sequential JPEG
86%|████████▌ | 5193/6032 [10:46<02:04, 6.76it/s]Corrupt JPEG data: 41 extraneous bytes before marker 0xd9
86%|████████▌ | 5196/6032 [10:47<01:33, 8.98it/s]Corrupt JPEG data: 41 extraneous bytes before marker 0xd9
86%|████████▋ | 5205/6032 [10:47<01:16, 10.86it/s]Corrupt JPEG data: 364 extraneous bytes before marker 0xd2
94%|█████████▍| 5690/6032 [11:43<00:41, 8.20it/s][undouble] >WARNING> Scaling not possible.
[undouble] >WARNING> Could not read: [./images/set1/445.jpg]
97%|█████████▋| 5855/6032 [12:03<00:13, 13.48it/s]Invalid SOS parameters for sequential JPEG
98%|█████████▊| 5891/6032 [12:09<00:13, 10.52it/s]Corrupt JPEG data: 98 extraneous bytes before marker 0xd9
100%|██████████| 6032/6032 [12:27<00:00, 8.07it/s]

Traceback (most recent call last):
File "app.py", line 64, in
model.import_data(targetdir)
File "/root/.local/lib/python3.6/site-packages/undouble/undouble.py", line 142, in import_data
self.results = self.clustimage.import_data(self.params['targetdir'], black_list=black_list)
File "/root/.local/lib/python3.6/site-packages/clustimage/clustimage.py", line 997, in import_data
X = self.preprocessing(Xraw, grayscale=self.params['cv2_imread_colorscale'], dim=self.params['dim'], flatten=flatten)
File "/root/.local/lib/python3.6/site-packages/clustimage/clustimage.py", line 833, in preprocessing
if np.where(np.array(list(map(len, img)))<min_nr_pixels)[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

import_data from http image_url

model = Undouble(method='phash', hash_size=8)
targetdir = [
'https://www.gardendesign.com/pictures/images/675x529Max/site_3/helianthus-yellow-flower-pixabay_11863.jpg'
]
model.import_data(targetdir)
model.compute_hash()

could you add a new feature of importing data directly from http image url not using downloading ono disk.

Thank you.

whash-db4 and crop-resistant-hash not working

Hello,

I've seen in the docs on readthedocs that in addition to ahash, phash, dhash and whash-haar, there are also whash-db4 and crop-resistant-hash.
I'm really interested in the latest but it appears to be not working. I get a "cannot compute hash" error Is it not implemented yet ?

Error when importing from URL

Hello, I haven't been able to get this library working with a list of URLs as the image source.
Running the example code from the docs: Import images from url location fails with the following trace:

[/usr/local/lib/python3.10/dist-packages/undouble/undouble.py](https://localhost:8080/#) in import_data(self, targetdir, black_list, return_results)
    165         # logger.info("Retrieving files from: [%s]" %(self.params['targetdir']))
    166         # Preprocessing the images the get them in the right scale and size etc
--> 167         self.results = self.clustimage.import_data(self.params['targetdir'], black_list=black_list)
    168         # Remove keys that are not used.
    169         if 'labels' in self.results: self.results.pop('labels')

[/usr/local/lib/python3.10/dist-packages/clustimage/clustimage.py](https://localhost:8080/#) in import_data(self, Xraw, flatten, black_list)
    992             Xraw = url2disk(Xraw, self.params['tempdir'])
    993             # Do not store in the object if the find functionality is used
--> 994             X = self.preprocessing(Xraw['pathnames'], grayscale=self.params['cv2_imread_colorscale'], dim=self.params['dim'], flatten=flatten)
    995             # Add the url location
    996             if Xraw['url'] is not None:

[/usr/local/lib/python3.10/dist-packages/clustimage/clustimage.py](https://localhost:8080/#) in preprocessing(self, pathnames, grayscale, dim, flatten)
    803         # Read and preprocess data
    804         imgs = list(map(lambda x: self.imread(x, colorscale=grayscale, dim=dim, flatten=flatten, return_succes=True), tqdm(pathnames, disable=disable_tqdm(), desc='[clustimage]')))
--> 805         img, imgOK = zip(*imgs)
    806         img = np.array(img)
    807 

ValueError: not enough values to unpack (expected 2, got 0)

Thanks

/tmp folder clean up needed

Hello,

I have found some kind of bug in undouble lib.

After using undouble lib. many times like millions, tons of empty temporary folders are made in /tmp (in linux) and never removed.
It leads to system fail because of lack of free inode (in linux).

It is because of "tempfile.mkdtemp()" in your source code from clustimage.py.

I think you have to remove temp folder in destructor in the class.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.