Dataset - UCI skin segmentation <a href="https://www.openml.org/d/1502" rel="nofollow"

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Can you do two things for me: provide minimal code to reproduc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanka! Also can you print out psx for a random ten rows (<a class="issue-link js-issu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

ValueError: operands could not be broadcast together with shapes about cleanlab HOT 9 CLOSED

panwarnaveen9 commented on May 18, 2024

ValueError: operands could not be broadcast together with shapes

from cleanlab.

Comments (9)

panwarnaveen9 commented on May 18, 2024 1

@cgnorthcutt PFA find code to reproduce error

import pandas as pd
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
from sklearn.ensemble import RandomForestClassifier

csv_path = "uci_skin_segmentaion.csv"

# Loading dataset using pandas  
df = pd.read_csv(csv_path)

print("Original data shape", df.shape)

# Making numpy array of traning data and label;  
data_x = df.iloc[:,:-1].values
data_y = df.iloc[:,-1].values

print("Data shape", data_x.shape)
print("Label shape", data_y.shape)

# Invoking label noise detection code
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
)

Sorry! I was out of town so couldn't reply earlier.

Regarding the above error, I still don't know the reason. I double check the data before sending it to cleanlab. It has only 2 classes.

print(np.unique(data_y)) # gave me only two classes - [1, 2]

Even I don't know from where cleanlab is picking third class. Dataset don't have it.

Even when I tried with LogisticRegression, I got the same error

est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = LogisticRegression(random_state=0, solver='lbfgs')
)

from cleanlab.

cgnorthcutt commented on May 18, 2024

Can you do two things for me:

provide minimal code to reproduce this error
print out the predicted probabilities for a random ten examples

from cleanlab.

panwarnaveen9 commented on May 18, 2024

@cgnorthcutt PFA find code to reproduce error

import pandas as pd
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
from sklearn.ensemble import RandomForestClassifier

csv_path = "uci_skin_segmentaion.csv"

# Loading dataset using pandas  
df = pd.read_csv(csv_path)

print("Original data shape", df.shape)

# Making numpy array of traning data and label;  
data_x = df.iloc[:,:-1].values
data_y = df.iloc[:,-1].values

print("Data shape", data_x.shape)
print("Label shape", data_y.shape)

# Invoking label noise detection code
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
)

from cleanlab.

cgnorthcutt commented on May 18, 2024

Thanka! Also can you print out psx for a random ten rows (#2 in original comment)

from cleanlab.

panwarnaveen9 commented on May 18, 2024

@cgnorthcutt Please find output of psx

[[1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [0.999 0.001]
 [0.999 0.001]
 [0.918 0.082]
 [0.999 0.001]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]]

Also PFA as code to print psx

from sklearn.model_selection import StratifiedKFold
import numpy as np
import copy

s = data_y
X = data_x
clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)

# Number of classes
K = len(np.unique(s))

# 'ps' is p(s=k)
ps = value_counts(s) / float(len(s))

cv_n_folds=5
# Create cross-validation object for out-of-sample predicted probabilities.
# CV folds preserve the fraction of noisy positive and
# noisy negative examples in each class.
kf = StratifiedKFold(n_splits=cv_n_folds, shuffle=True, random_state=None)

# Intialize psx array
psx = np.zeros((len(s), K))

# Split X and s into "cv_n_folds" stratified folds.
for k, (cv_train_idx, cv_holdout_idx) in enumerate(kf.split(X, s)):

    clf_copy = copy.deepcopy(clf)

    # Select the training and holdout cross-validated sets.
    X_train_cv, X_holdout_cv = X[cv_train_idx], X[cv_holdout_idx]
    s_train_cv, s_holdout_cv = s[cv_train_idx], s[cv_holdout_idx]

    # Fit the clf classifier to the training set and
    # predict on the holdout set and update psx.
    clf_copy.fit(X_train_cv, s_train_cv)
    psx_cv = clf_copy.predict_proba(X_holdout_cv) # P(s = k|x) # [:,1]
    psx[cv_holdout_idx] = psx_cv

# Printing psx 
print(psx[:15])

from cleanlab.

cgnorthcutt commented on May 18, 2024

Right so as you can see the classifier is basically exactly confident about every single example.... There is no uncertainty here at all and therefore no label errors, at least in what you are showing. Try using logistic regression.

By the way, your number of classes in psx is 2 but something your inputting in your first post has 3 unique classes, hence the mismatch.

Neither of these issues are a cleanlab error. So I'll leave this open a day or two and then close.

from cleanlab.

cgnorthcutt commented on May 18, 2024

@panwarnaveen9 -- any final questions about this? I will close as neither of your issues are cleanlab errors.

from cleanlab.

cgnorthcutt commented on May 18, 2024

Hi @panwarnaveen9 The fix to your problem turned out to be very simple. Just add this

data_y -= 1

The problem was your labels were not zero-indexed. cleanlab assumes labels start with label 0, but your labels were 1 and 2. I'll add a warning about this.

from cleanlab.

cgnorthcutt commented on May 18, 2024

p.s. now if you re-clone cleanlab with the latest commit and run your code, you'll get this error

TypeError: cleanlab requires zero-indexed labels (0,1,2,..,m-1), but in your case: np.unique(s) = [1 2]

from cleanlab.

ValueError: operands could not be broadcast together with shapes about cleanlab HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent