erdogant / clustimage Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 8.0 73.67 MB

clustimage is a python package for unsupervised clustering of images.

Home Page: https://erdogant.github.io/clustimage

License: Other

Python 0.60% Shell 0.01% Jupyter Notebook 99.40%

clustering image-analysis image-processing python3

clustimage's Introduction

Hi there! I am sharing my knowledge with the world through my blogs and open-source GitHub projects.

Your ❤️ is important to keep maintaining my packages. It is awsome that there are readily millions of downloads but to keep the libraries alive, I often need to make all kinds of fixes. This can eat up my entire weekend and evenings. Yes, I do this for free and in my free time! There are various ways you can help. You can report bugs/ issues, or even better help out with fixing bugs or maybe adding new features! If you don't have the time or maybe you are still learning, you can also take a Medium Membership using my referral link to keep reading all my hands-on blogs and learn more :-) If you don't need that, there is always an easy way with Coffee :-) Cheers!

A structured list of my repos

All Repos can be found in the Repositories section. If Sphinx pages are available, the link will directly go to the documentation pages.

Statistics	Machine learning	(Time)Series	Vizualization	Utils	API
bnlearn	clusteval	findpeaks	d3graph	df2onehot	googletrends
hnet	classeval	temporalrank	d3heatmap	pypickle	slacki
distfit	hgboost	caerus	treeplot	ismember
pca	clustimage		kaplanmeier	irelease
thompson	undouble		flameplot	pypiplot
benfordslaw			worldmap	dicter
			colourmap
			imagesc
			scatterd
			d3blocks

Find here my Pypi download stats

Overview of open issues

clustimage's People

Contributors

Stargazers

Watchers

Forkers

wx-b seanahmad volodos621 yurongzhu tdl77 agbleze gongjizhang

clustimage's Issues

Trying to retrieve original file path names in the results

Is there a way to be able to get the original pathnames of images used post fit_transform?

I am uploading images onto google colab, and reading them in by their filepaths as "/content/name_of_image", and then I wish to be able to recover this "/content/name_of_image" post running clustering.

I tried to extract pathnames per label using the following code, but seemed to be getting the filepaths for images created in a temporary directory as follows:

CODE
Iloc = cl.results['labels']==0
cl.results['pathnames'][Iloc]

OUTPUT
array(['/tmp/clustimage/8732cb41-c72d-4266-b164-ff453d68428a.png',
'/tmp/clustimage/440fecd8-8a9c-49a0-b100-ccfb66107425.png',
'/tmp/clustimage/3c9c38d8-4da9-4e4f-9130-d3836182b8c6.png',
'/tmp/clustimage/85cc4848-1faf-44ea-ae4c-9d9d88bd6323.png',
'/tmp/clustimage/6127e4fb-1c25-4ba9-8d68-56ef482e3db4.png',
'/tmp/clustimage/abcf85e0-af1a-48f1-8861-122122b64e32.png',
'/tmp/clustimage/275bbde0-394d-4ba4-b4d0-1c67da323c8b.png',
'/tmp/clustimage/30b62285-2628-45c0-86b2-fea305cb8db3.png',
'/tmp/clustimage/c47a6867-3c8f-480c-a7bd-b3e7ec4ba334.png',
'/tmp/clustimage/da5c17fc-de2a-4375-b03c-066a0904428a.png'], dtype='<U56')

I wish to get the output as the original filenames that were in the pathnames list.

Small question about recommanded usage

Hello, first of all thank you for your work your libraries are amzaing !

I didnt know how to contact you properly and an Issue is probably the wrong way to do it so feel free to clost it without answering if you feel like it.

Anyway I just wanted to ask you a question. I need to perform some similar images grouping :images can be faces, screenshots, drawings, memes etc. so very differents kinds, there are though some images with small (light crop, lighting ...) or big variations (bigger crop, text added etc.) and I'm trying to find a way to regroup them. Until now I was using your other library (undouble) which was working fine but sometimes the grouping functionnality was excluding images that were really close (when computing the ahashes manually these images all had the same ahash but they were not grouped by undouble.group which is odd).

So anyway I started trying to use clustimage and I'm a bit overwhelmed, there seem to be so much functionnalities, ways of computing features, distances, evaluating the clusters etc. etc.

I've read your medium article on clustimage which helps a bit, and I know you're saying one should choose the parameters according to the research question, but I'm no datascientist and I'm a bit lost. My take right now would to try to make a script that iterates over all the possible parameters of clustimage and compute a score based on the images grouping that i've made manually. But I think there must be a smarter way to proceed.

So in other words, my question is: do you recommand any particular set of methods and parameters to group variations of images which can be of very different types.

Thank you in advance and have a good day !

Find function errors

Hello,

The find function has stopped working, In fact if i use the "pca" i get this error:

If i use "hog" i get another type of error.

It was working before and i didn't change anything! could you please help me resolve the problem.

How does one load image not from cl.import_example

Thank you for this cool library.

How does one load images from sources other than your cl.import_example(data='flowers')?

My images are in mySQL database. If needed, I can put them into a directory.

Also, I'd appreciate recommended setting to group images like the following.

Trainability

Hello,

I just want to know if after running the model( with diffrent parameters) on similar datasets, will the model learn from one ruun to another. Like a classical NN when the weights are updated at each iteration?

if for exemple i run the code with diffrent parameters but i only save the last pkl file, am i only saving the weights of the last run or am i saving the whole thing?

I hope i explained myself well.

Logging

Hi there,

I'm getting logging messages on the console and interferes with messages that I'm already printing to the console. Moreover I can't see why a module should automatically do logging on the console if it wasn't asked to do so in first place. I can see the following block present in the clustimage.py file just after imports:

# Configure the logger
logger = logging.getLogger('')
[logger.removeHandler(handler) for handler in logger.handlers[:]]
console = logging.StreamHandler()
formatter = logging.Formatter('[clustimage] >%(levelname)s> %(message)s')
console.setFormatter(formatter)
logger.addHandler(console)
logger = logging.getLogger()

It seems that the logger is being configured without giving the user the opportunity to turn if off completely. Do I need to monkeypatch clustimage in order to get rid of it?

Also consider what would happen if every other Python module out there takes a similar approach to logging, we would get the console poluted with messages that we might not want to see at all.

Thanks for your time,
Lucas.

Memory Error during import_dat

Hi,

i am encountering a Memory error during the import_data step. After loading the images it throws a Memory Error. Anyway i can figure out what is the issue.

Traceback (most recent call last):
File "/home/stg/prod/combine_clustimage.py", line 56, in
results = cl.fit_transform(targetdir)
File "/home/stg/.local/lib/python3.10/site-packages/clustimage/clustimage.py", line 352, in fit_transform
_ = self.import_data(X, black_list=black_list)
File "/home/stg/.local/lib/python3.10/site-packages/clustimage/clustimage.py", line 992, in import_data
X = self.preprocessing(Xraw['pathnames'], grayscale=self.params['cv2_imread_colorscale'], dim=self.params['dim'], flatten=flatten)
File "/home/stg/.local/lib/python3.10/site-packages/clustimage/clustimage.py", line 806, in preprocessing
img, imgOK = zip(*imgs)
MemoryError

Label names mixed up in results and results_unique

Hi @erdogant,

first, thank you very much for the work you put in this project and making it public! When experimenting with your code base I came across an error when I access the 'labels' of cl.results() and cl.results_unique(). It seems to me that between both, the abels somehow mix up. I want to give you an example:

'img_name_1' is assigned a label '2' in cl.results_unique['labels'], however when I iterate over cl.results['labels'] and search for the file with the same name 'img_name_1' this image belongs (sometimes) to a different label, lets say '5' for example.

My goal is to extract random images and the most centered image (unique) image per cluster label, that is why I would like to match both labels. Maybe, do you have a different idea how I could do it?

Thank you very much!

Best,

Maximilian

Load the model

Hello, i have ran and saved the model in a .pkl file and i have been trying to load it again.
it says that it is loaded but i keep getting this error and i don't know how to fix it.

One more thing, if i succeed in loading the model, is it just for displaying the results? can't i run it again with another dataset hoping that it would recognize the shape that it had already clustered before ?

Thank you for your help!

Define number of clusters manually

Hello,
is it possible to define the exact number of clusters upfront?
I tried setting the min and max cluster to the same value but this didn't solved the issue.

Found a bug in "imresize"

When giving to fit_transform() a path as a string, it is not able to take the images due to a try-except control in import_data.
This is due to the fact that, if the image has already the same shape of the variable dim, the cv2.resize raise an error...
By changing the function imresize it works!

def imresize(img, dim=(128, 128)):
"""Resize image."""
if dim is not None and img.shape != dim:
img = cv2.resize(img, dim, interpolation=cv2.INTER_AREA)
return img

Extract features using deep learning

Hello! Great library. I'm currently using it to cluster unlabeled images. Clustering using pca or hog did not yield good results so I extracted features for my images from a pre-trained CNN model that I had , then continued with hierarchical clustering using clustimage. However, I had to work around the code in the library to make it work, mainly because I had to initialize the clustimage with pca or hog and then manually replace the results['feat'] with the deep learning features. If you're interested to expanding the library to deep learning features, let me know, I'll update the code and do a pull request!

Regards,
Joey

Extract list of files in cluster

Is there a way to generate a list of files belonging to a particular cluster? For example, which images are apart of cluster 3?

Transfer learning

Hello, is there a way to use the transfer learning technique with this model?
I would like to use it repeatedly with a cumulative dataset

Clustimage.find() and new images

Can I check that I'm using Clustimage.find() correctly? I'm passing it the filename of a new image, not the index of one already contained in the cl object (the docs seem to show an image from X being passed to find()). Was hoping it would show a shorter distance between the new input and the images in the bottom left and bottom right.

Thanks for publishing this library!

zero-size array to reduction operation minimum which has no identity

Hi! Thank you for your work!
I really love to use your library but I'm stuck to the beginnig, I don't want bother you but can you give me some advise?
I'm trying to cluster jpg images from a Google Drive folder, I've read the documentations and the blog but I can't make it right

Here is the snippet from Coolab

`
!pip install -U clustimage

from google.colab import drive
from clustimage import Clustimage

drive.mount('/content/drive', force_remount=True)

cl = Clustimage(method='pca')

X = cl.import_data('/content/drive/MyDrive/images/immagini/1')

Xfeat = cl.extract_feat(X)
...
`

It import the data but throws on feat extraction:
ValueError: zero-size array to reduction operation minimum which has no identity

Best regards, Silvio

Error while running clustimage module when embedding='tsne'

An error occurs while running this module, clustimage. The error refers to the 'embedding' setting when "embedding='tsne'". If I run the code while "embedding='none'", it works fine. The concern is that embedding is very practical for visual purposes and should be used. Any ideas why this error occurs and how to resolve it?

setting

cl = Clustimage(method='pca',
                embedding='tsne',
                grayscale=False,
                dim=(128,128),
                params_pca={'n_components':0.95},
                store_to_disk=True,
                verbose=50)

error

File "/Users/name/opt/anaconda3/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py", line 372, in _gradient_descent
    update = momentum * update - learning_rate * grad

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

embedding references
https://erdogant.github.io/clustimage/pages/html/clustimage.clustimage.html?highlight=embedding#clustimage.clustimage.Clustimage.embedding

Unconsistency in scatter function

Hello again, I think that we have a little inconsistency in the scatter function, leading the inconsistent behavior -- in my case, two inconsistent behaviors: 1. some scatter plots will show up and suddenly disappear, 2. no scatter plot shows up. Here is a minimal working example to reproduce these inconsistencies:

import sys
import os
import glob
import numpy as np
from clustimage import Clustimage
import matplotlib.pyplot as plt

cl = Clustimage(method='pca',dirpath=None,embedding='tsne',grayscale=False,dim=(128,128),params_pca={'n_components':0.95})
in_files = input("""Give the absolute path to a directory with your files: \n""")
some_files = glob.glob(in_files)
results = cl.fit_transform(some_files,
cluster='agglomerative',
evaluate='silhouette',
metric='euclidean',
linkage='ward',
min_clust=3,
max_clust=6,
cluster_space='high')

cl.clusteval.plot()
cl.clusteval.scatter(cl.results['xycoord'])
cl.pca.plot()
cl.plot_unique(img_mean=False)
cl.plot(cmap='binary')
cl.scatter(zoom=1, img_mean=False)
cl.scatter(zoom=None, dotsize=200, figsize=(25, 15), args_scatter={'fontsize':24, 'gradient':'#FFFFFF', 'cmap':'Set2', 'legend':True})

What I have done: your scatter function returns fig, ax which is a tuple of plt.subplot; plotting fig makes the plot showing up and disappearing all of a sudden. As a result, you can not save some scatter plots.

Workaround:

cl.scatter(zoom=None, dotsize=200, figsize=(25, 15), args_scatter={'fontsize':24, 'gradient':'#FFFFFF', 'cmap':'Set2', 'legend':True})
plt.show()

With plt.show(), the scatter plots wait for you to close the gui.

Best wishes cp

can not find or create dirpath under linux

Hello, and thank you for clustimage! Running a little script on a linux box generates the following error:

$ python3.8 clustimg.py
[clustimage] >ERROR> [None] does not exists or can not be created.
Traceback (most recent call last):
File "/home/cpsoz/.local/lib/python3.8/site-packages/clustimage/clustimage.py", line 2554, in _set_tempdir
dirpath = os.path.join(tempfile.tempdir, 'clustimage')
File "/usr/lib/python3.8/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "clustimg.py", line 9, in
cl = Clustimage(method='pca',dirpath=None,embedding='tsne',grayscale=False,dim=(128,128),params_pca={'n_components':0.5})
File "/home/cpsoz/.local/lib/python3.8/site-packages/clustimage/clustimage.py", line 208, in init
self.params['dirpath'] = _set_tempdir(dirpath)
File "/home/cpsoz/.local/lib/python3.8/site-packages/clustimage/clustimage.py", line 2566, in _set_tempdir
raise Exception(logger.error('[%s] does not exists or can not be created.', dirpath))
Exception: None

Script clustimg.py has the following content:

import sys
import os
import glob
from clustimage import Clustimage
cl = Clustimage(method='pca',dirpath=None,embedding='tsne',grayscale=False,dim=(128,128),params_pca={'n_components':0.5})
in_files = input("""Give the absolute path to a directory with your files: \n""")
some_files = glob.glob(in_files)
print(some_files)
results = cl.fit_transform(some_files,
cluster='agglomerative',
evaluate='silhouette',
metric='euclidean',
linkage='ward',
min_clust=3,
max_clust=8,
cluster_space='high')

cl.clusteval.plot()
cl.clusteval.scatter(cl.results['xycoord'])

What I have tried: I can run clustimage if I am changing the line 2554 in /home/cpsoz/.local/lib/python3.8/site-packages/clustimage/clustimage.py like this:

dirpath = os.path.join(tempfile.tempdir, 'clustimage') --> dirpath = os.path.join('/home/cpsoz', 'clustimage')

where /home/cpsoz is my user directory. 'tempfile' should default to the user directory if 'dirpath' is None, but it does not. As 'dirpath' is None per default, a workaround could be to prompt the user to enter his own path as 'dirpath'.

Note: replacing

cl = Clustimage(method='pca',dirpath=None,embedding='tsne',grayscale=False,dim=(128,128),params_pca={'n_components':0.5})

with

cl = Clustimage(method='pca',dirpath='/home/cpsoz',embedding='tsne',grayscale=False,dim=(128,128),params_pca={'n_components':0.5})

is producing the same error as mentioned.
Best wishes cp

Wrong distance metric used with hash features?

As far as I know, the euclidean distance metric does not make sense for comparing hashes as features. One would use the hamming distance instead.

However, the examples show pHash in combination with euclidean. Is this intended or did I miss something?

No files found with import_data in windows with .tif extension

Trying to use the import_data function in windows on a folder containing .tif files, it reports zero files. The problem does not occur on colab or other extensions.

load model

hello, is it not possible to load the already saved model and use it to cluster a new dataset.

libpng error not catched

When a file is corrupted, libpng generates an error and it is not caught.

See for example https://stackoverflow.com/questions/46683264/libpng-error-read-error-by-using-open-cv-imread or https://stackoverflow.com/questions/70598227/libpng-error-read-error-why-cant-this-image-be-processed-by-opencv

Following the advice found on the second link, I rewrote this part (maybe In case of rgb images: make gray images compatible with RGB is unnecessary):

%% Read image

def _imread(filepath, colorscale=1):
"""Read image from filepath using colour-scaling.

Parameters
----------
filepath : str
    path to file.
colorscale : int, default: 1 (gray)
    colour-scaling from opencv.
    * 0: cv2.IMREAD_GRAYSCALE
    * 1: cv2.IMREAD_COLOR
    * 8: cv2.COLOR_GRAY2RGB

Returns
-------
img : numpy array
    raw rgb or gray image.

"""
img=None
# if os.path.isfile(filepath):
# Read the image
# img = cv2.imread(, colorscale)
img = Image.open(filepath)

# Convert Image to numpy array
# It's not the most efficient way, but it works. *(link¹)
img = np.asarray(img)

# Remove alpha channel if existent
if len(img.shape) == 3 and img.shape[2] == 4:
    img = img[:, :, : 3]

# Restore RGB colors
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# In case of rgb images: make gray images compatible with RGB
if ((colorscale!=0) and (colorscale!=6)) and (len(img.shape)<3):
    img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
# else:
#     logger.warning('File does not exists: %s', filepath)

return cv2.cvtColor(img, colorscale)`

mini-batch k-means

Hello,
I have been using your model for a while now and i was thinking of how to use your model to create a clustering model that learns with every iteration, and one of the simple ideas that occured to me is that we can use "mini-batch k-means" instead of using the whole dataset we use a small batch of data at a time.

Do you think you can update the k-means code to use mini-batch k-means instead?

Question on CNN

Is it possible to use pre-trained neural network (for example, based on ImageNet) for image features in clustimage?
And do you support using U-MAP for dimension reduction?