Code Monkey home page Code Monkey logo

Comments (6)

HongxinXiang avatar HongxinXiang commented on August 28, 2024

Hi, Soo

Thank you first for your attention to our papers.

In some cases, AUC and accuracy may conflict. This is because the accuracy is calculated based on the default cutoff value (such as 0.5), while the AUC is calculated based on all possible cutoff values, which should be more robust.

I am not sure whether your sample has class imbalance. If so, you can try to increase the weight of the minority class.

You can also provide me with the predicted probabilities and corresponding ground truth pickle or numpy files so that I can do better analysis.

from imagemol.

SoodabehGhaffari avatar SoodabehGhaffari commented on August 28, 2024

Thank you for the prompt reply.
Yes, our training and test data are so imbalanced (around 1% positive). Could you guide how to increase the weight of the minority class and what value should I set?

Here are the ground truth and the predicted probabilities for the test data:

df_pro_imagemol_classification.csv
df_test.csv.txt

Also, do you have any idea why the predicted values for the regression model for the test data are the same? Here is the predicted values for the regression model:

df_scores.csv

I really appreciate your help.

Best Regards
Soo

from imagemol.

HongxinXiang avatar HongxinXiang commented on August 28, 2024

Hi, Soo

The following figure is the AUC curve I drew based on the provided df_pro_imagemol_classification.csv and df_test.csv.txt:
image
Given that your sample is unbalanced, there are two approaches to consider:

  1. Find the best classification threshold based on the AUC curve:
def find_optimal_cutoff(tpr, fpr, threshold):
    y = tpr - fpr
    index = np.argmax(y)
    optimal_threshold = threshold[index]
    point = [fpr[index], tpr[index]]
    return optimal_threshold, point
fpr, tpr, threshold = roc_curve(y_true, y_pro) 
find_optimal_cutoff(tpr, fpr, threshold)
# my output is (0.001, [0.22790055248618785, 0.5833333333333334])

so, I use 0.001 as classification threshhold:

a = np.array(y_pro).copy()
a[a>0.001] = 1
a[a<0.001] = 0
(a == y_true).sum() / 2208

I can get 0.7685688405797102 accuracy.

  1. Add the weight parameter in BCEWithLogitsLoss. You can look the docs. it's very easy to get started. I saw that the minority class is only 1%. You might as well try setting the minority class weight to 100 and the majority class weight to 1.

Anyway, in cases of extremely imbalanced samples, I recommend reporting the AUC metric since it is a more comprehensive metric and better suited to imbalanced data.

In addition, I'm not sure that why the predicted values for the regression model for the test data are the same. But I'm guessing that your regression labels may have a large gap, causing the model to collapse during training. I suggest you can use some normalization method on the labels.

from imagemol.

SoodabehGhaffari avatar SoodabehGhaffari commented on August 28, 2024

Thank you for the detailed response. I really appreciate your help.

  1. Does it make sense to have a so small threshold for the classification model such as 0.001?

  2. I checked BCEWithLogitsLoss in the ImageMol.
    Is it correct to change the code as follows :

    weights = None
    if args.task_type == "classification":
    if args.weighted_CE:
    labels_train_list = labels_train[labels_train != -1].flatten().tolist()
    count_labels_train = Counter(labels_train_list)
    imbalance_weight = {key: 1 - count_labels_train[key] / len(labels_train_list) for key in count_labels_train.keys()}
    weights = np.array(sorted(imbalance_weight.items(), key=lambda x: x[0]), dtype="float")[:, 1]

        num_positives = count_labels_train[1]  # assuming 1 is the label for positive class
       num_negatives = count_labels_train[0]  # assuming 0 is the label for negative class
    
      ratio_pos_neg = num_positives / num_negatives if num_negatives != 0 else 1
    
      pos_weight = torch.tensor([ratio_pos_neg])
    
     criterion = nn.BCEWithLogitsLoss(reduction="none",pos_weight=pos_weight)
    
  3. Regarding the regression model, the training data for the classification and regression is the same with one difference: the continuous label in the regression was converted to binary for the classification. Since the data for classification is imbalanced, I am sure that the labels have a gap. Most of the labels are between zero to 10%. Do you have any suggestions for the normalization method on the labels?

Thank you
Best Regards
Soo

from imagemol.

HongxinXiang avatar HongxinXiang commented on August 28, 2024

Sorry for the late reply.

  1. This is an open question and it's hard for me to answer you. In my experience, I would not focus on model accuracy in imbalanced data because it is meaningless. So, I would focus on the AUC since it has nothing to do with the threshold.
  2. It is correct.
  3. You can try StandardScaler from scikit-learn library.

from imagemol.

SoodabehGhaffari avatar SoodabehGhaffari commented on August 28, 2024

Hello,
I wanted to give you an update that I tried using the positive weights for the minority class in the loss function as we discussed before, but the issue persists. The values of predicted probability are too small. Do you have any other suggestions or do you think there is a way to fix this issue?

Thanks a lot
Best Regards
Soo

from imagemol.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.