A simple CLI tool for detecting and removing duplicate images from a dataset.
The tool uses image hashing, specifically 'Difference Hashing' to identify the duplicates in folder.
> git clone https://github.com/chinvib66/detect-duplicate-img.git
> cd detect-duplicate-img
> pipenv install
> pipenv shell
> python .\cli.py --dataset \path\to\img\dataset --remove 0
- --dataset: Absolute path to your dataset folder
- --remove: To permenantly remove the duplicates, set to 1; just to detect with out removing, set to 0
Steps:
- Convert Image to grayscale
- Resize to 9x8 (to create near 64 bit hash)
- Compute difference between adjacent pixels
- Build Hash by comparing adjacent pixels
Tutorial referred: