Code Monkey home page Code Monkey logo

Comments (9)

panwarnaveen9 avatar panwarnaveen9 commented on May 18, 2024 1

@cgnorthcutt PFA find code to reproduce error

import pandas as pd
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
from sklearn.ensemble import RandomForestClassifier

csv_path = "uci_skin_segmentaion.csv"

# Loading dataset using pandas  
df = pd.read_csv(csv_path)

print("Original data shape", df.shape)

# Making numpy array of traning data and label;  
data_x = df.iloc[:,:-1].values
data_y = df.iloc[:,-1].values

print("Data shape", data_x.shape)
print("Label shape", data_y.shape)

# Invoking label noise detection code
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
)

Sorry! I was out of town so couldn't reply earlier.

Regarding the above error, I still don't know the reason. I double check the data before sending it to cleanlab. It has only 2 classes.

print(np.unique(data_y)) # gave me only two classes - [1, 2] 

Even I don't know from where cleanlab is picking third class. Dataset don't have it.

Even when I tried with LogisticRegression, I got the same error

est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = LogisticRegression(random_state=0, solver='lbfgs')
)

image

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

Can you do two things for me:

  1. provide minimal code to reproduce this error
  2. print out the predicted probabilities for a random ten examples

from cleanlab.

panwarnaveen9 avatar panwarnaveen9 commented on May 18, 2024

@cgnorthcutt PFA find code to reproduce error

import pandas as pd
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
from sklearn.ensemble import RandomForestClassifier

csv_path = "uci_skin_segmentaion.csv"

# Loading dataset using pandas  
df = pd.read_csv(csv_path)

print("Original data shape", df.shape)

# Making numpy array of traning data and label;  
data_x = df.iloc[:,:-1].values
data_y = df.iloc[:,-1].values

print("Data shape", data_x.shape)
print("Label shape", data_y.shape)

# Invoking label noise detection code
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
            X=data_x,
            s=data_y,
            clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
)

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

Thanka! Also can you print out psx for a random ten rows (#2 in original comment)

from cleanlab.

panwarnaveen9 avatar panwarnaveen9 commented on May 18, 2024

@cgnorthcutt Please find output of psx

[[1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [0.999 0.001]
 [0.999 0.001]
 [0.918 0.082]
 [0.999 0.001]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]
 [1.    0.   ]]

Also PFA as code to print psx

from sklearn.model_selection import StratifiedKFold
import numpy as np
import copy

s = data_y
X = data_x
clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)

# Number of classes
K = len(np.unique(s))

# 'ps' is p(s=k)
ps = value_counts(s) / float(len(s))

cv_n_folds=5
# Create cross-validation object for out-of-sample predicted probabilities.
# CV folds preserve the fraction of noisy positive and
# noisy negative examples in each class.
kf = StratifiedKFold(n_splits=cv_n_folds, shuffle=True, random_state=None)

# Intialize psx array
psx = np.zeros((len(s), K))

# Split X and s into "cv_n_folds" stratified folds.
for k, (cv_train_idx, cv_holdout_idx) in enumerate(kf.split(X, s)):

    clf_copy = copy.deepcopy(clf)

    # Select the training and holdout cross-validated sets.
    X_train_cv, X_holdout_cv = X[cv_train_idx], X[cv_holdout_idx]
    s_train_cv, s_holdout_cv = s[cv_train_idx], s[cv_holdout_idx]

    # Fit the clf classifier to the training set and
    # predict on the holdout set and update psx.
    clf_copy.fit(X_train_cv, s_train_cv)
    psx_cv = clf_copy.predict_proba(X_holdout_cv) # P(s = k|x) # [:,1]
    psx[cv_holdout_idx] = psx_cv

# Printing psx 
print(psx[:15])

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

Right so as you can see the classifier is basically exactly confident about every single example.... There is no uncertainty here at all and therefore no label errors, at least in what you are showing. Try using logistic regression.

By the way, your number of classes in psx is 2 but something your inputting in your first post has 3 unique classes, hence the mismatch.

Neither of these issues are a cleanlab error. So I'll leave this open a day or two and then close.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

@panwarnaveen9 -- any final questions about this? I will close as neither of your issues are cleanlab errors.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

Hi @panwarnaveen9 The fix to your problem turned out to be very simple. Just add this

data_y -= 1

The problem was your labels were not zero-indexed. cleanlab assumes labels start with label 0, but your labels were 1 and 2. I'll add a warning about this.

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 18, 2024

p.s. now if you re-clone cleanlab with the latest commit and run your code, you'll get this error

TypeError: cleanlab requires zero-indexed labels (0,1,2,..,m-1), but in your case: np.unique(s) = [1 2]

from cleanlab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.