Hello, <a href="https://github.com/HongxinXiang/ImageMol/files/13811927/training_c

Sorry for the late reply. This is an open question and it's ha

issue with ImageMol performance about imagemol HOT 6 OPEN

SoodabehGhaffari commented on August 28, 2024

issue with ImageMol performance

from imagemol.

Comments (6)

HongxinXiang commented on August 28, 2024

Hi, Soo

Thank you first for your attention to our papers.

In some cases, AUC and accuracy may conflict. This is because the accuracy is calculated based on the default cutoff value (such as 0.5), while the AUC is calculated based on all possible cutoff values, which should be more robust.

I am not sure whether your sample has class imbalance. If so, you can try to increase the weight of the minority class.

You can also provide me with the predicted probabilities and corresponding ground truth pickle or numpy files so that I can do better analysis.

from imagemol.

SoodabehGhaffari commented on August 28, 2024

Thank you for the prompt reply.
Yes, our training and test data are so imbalanced (around 1% positive). Could you guide how to increase the weight of the minority class and what value should I set?

Here are the ground truth and the predicted probabilities for the test data:

df_pro_imagemol_classification.csv
df_test.csv.txt

Also, do you have any idea why the predicted values for the regression model for the test data are the same? Here is the predicted values for the regression model:

df_scores.csv

I really appreciate your help.

Best Regards
Soo

from imagemol.

HongxinXiang commented on August 28, 2024

Hi, Soo

The following figure is the AUC curve I drew based on the provided df_pro_imagemol_classification.csv and df_test.csv.txt:

Given that your sample is unbalanced, there are two approaches to consider:

Find the best classification threshold based on the AUC curve:

def find_optimal_cutoff(tpr, fpr, threshold):
    y = tpr - fpr
    index = np.argmax(y)
    optimal_threshold = threshold[index]
    point = [fpr[index], tpr[index]]
    return optimal_threshold, point
fpr, tpr, threshold = roc_curve(y_true, y_pro) 
find_optimal_cutoff(tpr, fpr, threshold)
# my output is (0.001, [0.22790055248618785, 0.5833333333333334])

so, I use 0.001 as classification threshhold:

a = np.array(y_pro).copy()
a[a>0.001] = 1
a[a<0.001] = 0
(a == y_true).sum() / 2208

I can get 0.7685688405797102 accuracy.

Add the weight parameter in BCEWithLogitsLoss. You can look the docs. it's very easy to get started. I saw that the minority class is only 1%. You might as well try setting the minority class weight to 100 and the majority class weight to 1.

Anyway, in cases of extremely imbalanced samples, I recommend reporting the AUC metric since it is a more comprehensive metric and better suited to imbalanced data.

In addition, I'm not sure that why the predicted values for the regression model for the test data are the same. But I'm guessing that your regression labels may have a large gap, causing the model to collapse during training. I suggest you can use some normalization method on the labels.

from imagemol.

SoodabehGhaffari commented on August 28, 2024

Thank you for the detailed response. I really appreciate your help.

Does it make sense to have a so small threshold for the classification model such as 0.001?
I checked BCEWithLogitsLoss in the ImageMol.
Is it correct to change the code as follows :

weights = None
if args.task_type == "classification":
if args.weighted_CE:
labels_train_list = labels_train[labels_train != -1].flatten().tolist()
count_labels_train = Counter(labels_train_list)
imbalance_weight = {key: 1 - count_labels_train[key] / len(labels_train_list) for key in count_labels_train.keys()}
weights = np.array(sorted(imbalance_weight.items(), key=lambda x: x[0]), dtype="float")[:, 1]
```
    num_positives = count_labels_train[1]  # assuming 1 is the label for positive class
   num_negatives = count_labels_train[0]  # assuming 0 is the label for negative class

  ratio_pos_neg = num_positives / num_negatives if num_negatives != 0 else 1

  pos_weight = torch.tensor([ratio_pos_neg])

 criterion = nn.BCEWithLogitsLoss(reduction="none",pos_weight=pos_weight)
```
Regarding the regression model, the training data for the classification and regression is the same with one difference: the continuous label in the regression was converted to binary for the classification. Since the data for classification is imbalanced, I am sure that the labels have a gap. Most of the labels are between zero to 10%. Do you have any suggestions for the normalization method on the labels?

Thank you
Best Regards
Soo

from imagemol.

HongxinXiang commented on August 28, 2024

Sorry for the late reply.

This is an open question and it's hard for me to answer you. In my experience, I would not focus on model accuracy in imbalanced data because it is meaningless. So, I would focus on the AUC since it has nothing to do with the threshold.
It is correct.
You can try StandardScaler from scikit-learn library.

from imagemol.

SoodabehGhaffari commented on August 28, 2024

Hello,
I wanted to give you an update that I tried using the positive weights for the minority class in the loss function as we discussed before, but the issue persists. The values of predicted probability are too small. Do you have any other suggestions or do you think there is a way to fix this issue?

Thanks a lot
Best Regards
Soo

from imagemol.

issue with ImageMol performance about imagemol HOT 6 OPEN

Comments (6)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent