Comments (9)
@cgnorthcutt PFA find code to reproduce error
import pandas as pd from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba from sklearn.ensemble import RandomForestClassifier csv_path = "uci_skin_segmentaion.csv" # Loading dataset using pandas df = pd.read_csv(csv_path) print("Original data shape", df.shape) # Making numpy array of traning data and label; data_x = df.iloc[:,:-1].values data_y = df.iloc[:,-1].values print("Data shape", data_x.shape) print("Label shape", data_y.shape) # Invoking label noise detection code est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba ( X=data_x, s=data_y, clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0) )
Sorry! I was out of town so couldn't reply earlier.
Regarding the above error, I still don't know the reason. I double check the data before sending it to cleanlab. It has only 2 classes.
print(np.unique(data_y)) # gave me only two classes - [1, 2]
Even I don't know from where cleanlab is picking third class. Dataset don't have it.
Even when I tried with LogisticRegression
, I got the same error
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
X=data_x,
s=data_y,
clf = LogisticRegression(random_state=0, solver='lbfgs')
)
from cleanlab.
Can you do two things for me:
- provide minimal code to reproduce this error
- print out the predicted probabilities for a random ten examples
from cleanlab.
@cgnorthcutt PFA find code to reproduce error
import pandas as pd
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
from sklearn.ensemble import RandomForestClassifier
csv_path = "uci_skin_segmentaion.csv"
# Loading dataset using pandas
df = pd.read_csv(csv_path)
print("Original data shape", df.shape)
# Making numpy array of traning data and label;
data_x = df.iloc[:,:-1].values
data_y = df.iloc[:,-1].values
print("Data shape", data_x.shape)
print("Label shape", data_y.shape)
# Invoking label noise detection code
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba (
X=data_x,
s=data_y,
clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
)
from cleanlab.
Thanka! Also can you print out psx for a random ten rows (#2 in original comment)
from cleanlab.
@cgnorthcutt Please find output of psx
[[1. 0. ]
[1. 0. ]
[1. 0. ]
[0.999 0.001]
[0.999 0.001]
[0.918 0.082]
[0.999 0.001]
[1. 0. ]
[1. 0. ]
[1. 0. ]
[1. 0. ]
[1. 0. ]
[1. 0. ]
[1. 0. ]
[1. 0. ]]
Also PFA as code to print psx
from sklearn.model_selection import StratifiedKFold
import numpy as np
import copy
s = data_y
X = data_x
clf = RandomForestClassifier(n_estimators=1000, max_depth=25, random_state=0)
# Number of classes
K = len(np.unique(s))
# 'ps' is p(s=k)
ps = value_counts(s) / float(len(s))
cv_n_folds=5
# Create cross-validation object for out-of-sample predicted probabilities.
# CV folds preserve the fraction of noisy positive and
# noisy negative examples in each class.
kf = StratifiedKFold(n_splits=cv_n_folds, shuffle=True, random_state=None)
# Intialize psx array
psx = np.zeros((len(s), K))
# Split X and s into "cv_n_folds" stratified folds.
for k, (cv_train_idx, cv_holdout_idx) in enumerate(kf.split(X, s)):
clf_copy = copy.deepcopy(clf)
# Select the training and holdout cross-validated sets.
X_train_cv, X_holdout_cv = X[cv_train_idx], X[cv_holdout_idx]
s_train_cv, s_holdout_cv = s[cv_train_idx], s[cv_holdout_idx]
# Fit the clf classifier to the training set and
# predict on the holdout set and update psx.
clf_copy.fit(X_train_cv, s_train_cv)
psx_cv = clf_copy.predict_proba(X_holdout_cv) # P(s = k|x) # [:,1]
psx[cv_holdout_idx] = psx_cv
# Printing psx
print(psx[:15])
from cleanlab.
Right so as you can see the classifier is basically exactly confident about every single example.... There is no uncertainty here at all and therefore no label errors, at least in what you are showing. Try using logistic regression.
By the way, your number of classes in psx is 2 but something your inputting in your first post has 3 unique classes, hence the mismatch.
Neither of these issues are a cleanlab error. So I'll leave this open a day or two and then close.
from cleanlab.
@panwarnaveen9 -- any final questions about this? I will close as neither of your issues are cleanlab errors.
from cleanlab.
Hi @panwarnaveen9 The fix to your problem turned out to be very simple. Just add this
data_y -= 1
The problem was your labels were not zero-indexed. cleanlab assumes labels start with label 0, but your labels were 1 and 2. I'll add a warning about this.
from cleanlab.
p.s. now if you re-clone cleanlab with the latest commit and run your code, you'll get this error
TypeError: cleanlab requires zero-indexed labels (0,1,2,..,m-1), but in your case: np.unique(s) = [1 2]
from cleanlab.
Related Issues (20)
- Error in null: Ambiguous truth value of a Series HOT 4
- Add end-to-end tests at the end of Datalab quickstart tutorial
- get rid of warnings in the datalab quickstart tutorial
- Remove Tensorflow version constraint in developer dependencies
- add unit test with all identical dataset HOT 3
- Difference of object detection confident learning with objectlab paper HOT 1
- update coveragerc to only skip over specific experimental subfolders that currently are untested
- Null issue check throwing an error HOT 1
- lab.find_issues(features=features) outputs error for underperforming issue HOT 1
- Object detection, segmentation k-fold practical issue HOT 1
- Trying to create Datalab object with label set to a dtype of 'category' but getting 'NotImplementedError'
- test_scores_for_identical_examples unit test fails
- be able to pass in kwargs to plt.show()
- datalab issue guide should better describe the relevant cleanlab columns
- Trying to build docs with a new notebook I have created but getting `AttributeError` from the audio.ipynb tutorial HOT 1
- Doctests are failing for some functions HOT 1
- In the “Synthetic Data Quality” part, do we need the same amount of real data and generated data HOT 1
- image datalab tutorial broken: Getting build error RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [64, 1, 1, 28, 28] HOT 2
- 3D Cleanlab / DCAI ?
- Follow-Up: Revert macOS CI Environment to Latest Version Once Python Compatibility Is Resolved
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cleanlab.