nickkunz / smogn Goto Github PK

View Code? Open in Web Editor NEW

301.0 9.0 74.0 748 KB

Synthetic Minority Over-Sampling Technique for Regression

Home Page: https://pypi.org/project/smogn

License: GNU General Public License v3.0

Python 100.00%

imbalanced-data smote synthetic-data over-sampling regression

smogn's Introduction

About Me

Interested in the intersection of analytics, statistics, and machine learning methods in urbanism, public health, and engineering. Previous military and civil service with a deep commitment to the public good. Inactive musician with previous work in various roles on tours across North America.

Stats

smogn's People

Contributors

Stargazers

Watchers

Forkers

zhangjulin neverfox akhilgakhar joeltok carvalhoamc selinawj changyeon2 arulvelkumar lasttrader astrogilda irene0204 lwangbbs dotrado mdash84 eliasmgprado rachecave zzy1601 sandy4321 lesuit chrisleaman dmakienko raphaelscho zaky9 basaks augustkrzhu mahdiarj kaigewang juunho udemirezen safetymary kulikdm arnoldysyeung wuwenglei paobranco amine179 seazuma ashikshafi08 ywg0212 eduardoslopes riledigital ndao1104 mavisdiles qwtail eduprogram apcc-geoslegend siebevz-rn dylanrubini dylan-rubini dylan-rubini-coolbrook dixinlike creakycaad gowthamkumar1236 vortexmath akemid martaccmoreno iq-scm nolangise xinzhao13 peterhamfelt davidemastricci mohitburkule jmehami1 majid-soheili dmitrimaxwellsymbolica

smogn's Issues

Documentation on the relevance value matrix

I've been reading everywhere in the docs and code to understand the meaning of these arrays yet I still can't find the correct answer :

do these arrays mean that we'll oversample the value 35000 itself and undersample the other 3 values ?
or we'll over sample a certain range around this value for example the range [35000 - X : 35000 + X ]
Or we'll oversample the range 35000 to 125000 and undersample the range 125000 to 250000 ...

Here are the arrays I'm talking about

## specify phi relevance values
rg_mtrx = [

    [35000,  1, 0],  ## over-sample ("minority")
    [125000, 0, 0],  ## under-sample ("majority")
    [200000, 0, 0],  ## under-sample
    [250000, 0, 0],  ## under-sample
]

Handling categorical features

Hello, if my dataset has categorical features, how SMOGN gonna handle that?
Shall I preprocess them in any way, like One Hot Encoding, etc.

Hyperparameter optimization

Hello,

I was wondering the best way to perform hyperparameter optmization for SMOGN. When I'm using SMOTE and performing hyperparameter optimization, I use imblearn pipeline. Is there something similar for SMOGN?

Thank you!
Monika

SMOGN with `under_samp`=False fails to return original data

Hello devs,

I have a small amount of data and need to oversample outliers without undersampling my original data. However, running SMOGN with under_samp=False appears to return only the synthetic data. This can easily be worked around by concating the output to the original data, but it is extremely unintuitive, doesn't seem to be documented anywhere, and is not conducive to including SMOGN in a larger pipeline.

is this behaviour intended?

Colab example here

Random Seed Implementation

Would it be possible to have an argument for random seed so that we can reproduce results?

Identify generated examples

Hello @nickkunz ,
Thanks for the great repo!
Just a quick question: after applying SMOGN to the dataframe, is it possible to easily identify which examples in the modified dataframe belonged to the original dataframe, and which were generated on the fly?
Would be important to know if the trained model can correctly predict the ground truths (rather ground truths + generated examples).

oops! synthetic data contains missing values

I dont know why this error is happening, how can I avoid it?

Its taking more than 20h to sample the data

Hi Nick,

I am seeing huge runtime for my input data which is of 28K * 59.
Its running for more than a day.
I have even standardized the input data
Any possible solution ?

dist_matrix: 5%|4 | 276/5671 [50:48<16:50:38, 11.24s/it]

Reproduceability of smoter

How can I fix the outcome of smogn.smoter?
I tried

np.random.seed = 1

but it seems smoter with samp_method="extreme" results in different results. Is it possible to make it reproduceable or is it not reproduceable by nature?

Also I set smogn.smoter(..., seed=1,...). This only works when installed package directly from github. The PyPI release doesn't have this argument.

Using Smogn only reducing number of observations

Hey all, would love to hear if anyone here can help me.

I have a low quantity of data rows (shape - 198 rows x 13 cols DF).
My goal using smogn is to increase the number of observations with gaussian noise to make my ML model more robust.
I still didnt figured it out how to control all those parameters to increase my data quantity without getting out of the ranges of my columns and without get duplicate rows.

I would love to hear any recommendation / examples.

Thanks all,
Roi.

HI

Thanks for this app!
Concerning the advanced example, why rel_thres and _rel_ctrl_pts_rg_ are provided >
if the rel_method = "manual why rel_thres is used also ??

how can you calculate rel_thres ?

thank you.
wishing you a grate day

How to specify resampling range?

Hello,

I'm trying to use SMOGN on my dataset, the default parameters are good to some extent but I was wondering if I could specify the range I want to oversample or undersample?
For example, my Y variables are between 3-8, and there really only a few data points for numbers between 7-8, how can I oversample only the data points between 7-8?

The advanced example have something like this mentioned

## specify phi relevance values
rg_mtrx = [

    [35000,  1, 0],  ## over-sample ("minority")
    [125000, 0, 0],  ## under-sample ("majority")
    [200000, 0, 0],  ## under-sample
    [250000, 0, 0],  ## under-sample
]

But I couldn't make sense of these values, in [35000, 1, and 0], what are the 1 and 0 for? what do they represent? It says somewhere that it's a 2d array (format: [x, y]), which xy is it? and why are there 3 values if it's only x and y?

Thanks in advance for any help :)

Error message: redefine phi relevance functions

Hi Nick,

Many thanks for making this package available!

With my data set and following the code example for the intermediate exercise, I bumped into this error message: redefine phi relevance function: all points are 0

Checking the source code, I noticed that there is a safeguard:

if all(i == 1 for i in y_phi):
raise ValueError("redefine phi relevance function: all points are 0")

but I could not further understand how this links to my data. I am using Python 3.6.5 on a windows machine and smogn 0.1.2

I attached a copy of the script and input data.

Thanks for your help,

Ivan

Testing_SMOGN_package.zip

some features are missing after resampling

Hi, thanks for the package. I'm running into this problem. I have a dataframe with 20 columns/ features. It seems that after resampling with smogn, I end up with only 15 columns.

)
My code looks like this:
rg_mtrx = [
[40, 1, 0], ## over-sample ("minority")
[0, 0, 0], ## under-sample ("majority")
]
train_set_smogn=smogn.smoter(
data=train_set.reset_index(drop=True),
y='count',
#k=10,
k=5,
pert=0.04,
samp_method='extreme',
rel_thres = 0.1,
rel_method = "manual",
rel_xtrm_type = "high",
rel_coef = 0.01,
rel_ctrl_pts_rg = rg_mtrx
)

x_smogn=train_set_smogn
x_smogn=x_smogn.drop('count',axis=1)
y_smogn=train_set_smogn['count']

IndexError: positional indexers are out-of-bounds

I am running this with data as shown in example. And I get IndexError: positional indexers are out-of-bounds.

data.shape
(1000, 15)

rg_mtrx = [

    [35000,  1, 0],  ## over-sample ("minority")
    [125000, 0, 0],  ## under-sample ("majority")
    [200000, 0, 0],  ## under-sample
    [250000, 0, 0],  ## under-sample
]

gal_data_smogn = smogn.smoter(
    
    ## main arguments
    data = X_sample,           ## pandas dataframe
    y = 'y_val',          ## string ('header name')
    k = 7,                    ## positive integer (k < n)
    pert = 0.04,              ## real number (0 < R < 1)
    samp_method = 'balance',  ## string ('balance' or 'extreme')
    drop_na_col = True,       ## boolean (True or False)
    drop_na_row = True,       ## boolean (True or False)
    replace = False,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres = 0.10,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)

When I run this I get the IndexError.

Could you explain what exactly is the `rel_coef` argument?

What aspect of box plot variation does it measure?

SMOGN is creating a new class for target

Hey!
Any idea on why is the algorithm creating a new class (value) for my target? I'm analyzing the Room_Occupancy_Dataset from Kaggle, and in this dataset the target only has four values for occupancy (0, 1, 2, 3 people in the room), but it is expected for the model to be able to predict other cases that have more than 3 people in the room. SMOGN is not balancing the data correctly, because the majority class (0) remains equal, and the minority classes (1,2,3) are not over-sampled. Plus, it creates an extra value (4). I don't know if this is a bug, but i hope you can help me fix it. This is my 2d array:

rg_mtrx = [

    [0, 0, 0],  ## under-sample ("majority")
    [1, 1, 0],  ## over-sample ("minority")
    [2, 1, 0],  ## over-sample ("minority")
    [3, 1, 0],  ## over-sample ("minority")
]

## conduct smogn
balanced_smogn = smogn.smoter(
    
    ## main arguments
    data = df,            ## pandas dataframe
    y = 'Room_Occupancy_Count', ## string ('header name')
    k = 5,                    ## positive integer (k < n)
    pert = 0.02,              ## real number (0 < R < 1)
    samp_method = 'extreme',  ## string ('balance' or 'extreme')
    drop_na_col = False,       ## boolean (True or False)
    drop_na_row = False,       ## boolean (True or False)
    replace = True,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres = 0.50,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)

Reducing verboseness

Is there an option to reduce the printed output? I'm running this in a notebook and every time I synthesize, the cell prints these:

dist_matrix: 100%|##########| 14/14 [00:00<00:00, 400.82it/s]
synth_matrix: 100%|##########| 14/14 [00:00<00:00, 637.42it/s]
r_index: 100%|##########| 5/5 [00:00<00:00, 626.63it/s]

I'd just like to hide these really.

Error during running advanced ex3

Defining sampling strategy

Is it possible to use the algorithm to apply upsampling without any downsampling.
For example, if I have a dataset with the following distribution of the target feature:
500 Negative Samples
200 Positive Samples
1000 ==0 Samples

Can I set the algorithm to only upsample the number of positive values without affecting the number of negative and equal to zero samples. For example, the output will be

500 Negative Samples
500 Positive Samples
1000 ==0 Samples

I know that in the imblearn.over_sampling.SMOTENC function it is possible to set the 'sampling_strategy' argument to a dictionary where the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

smogn time complexity

I am running the smogn.smoter function on a dataset of size [77955, 4], and those 4 columns include the target variable Y. Y is a continuous r.v. that follows a skewed distribution where it covers the range [0, 1.8] and 55000 of the training instances lie in the range [0,0.07].

However, the function hasn't finished executing for a long time, so I was wondering what is the complexity of that algorithm?

IndexError: indices are out-of-bounds

Hi Nick,

Great package!

I just ran into an IndexError when the DataFrame index values are not from a RangeIndex. I would imagine this to happen quite often if the user passes in training data from a shuffled train-test split.

Code to reproduce the error:

import pandas as pd
import smogn
housing = pd.read_csv('https://raw.githubusercontent.com/nickkunz/smogn/master/data/housing.csv')
smogn.smoter(housing[housing.index > 10], 'SalePrice')

smogn.smoter(housing[housing.index > 10].reset_index(), 'SalePrice') fixes it, but is not necessarily desirable because I would like (need) to preserve the original index.

Best,
Michael

The possibility of applying this method in the field of images

This is an interesting work. I'm wandering if it can be used in the field of computer vision, where the input has a two-dimensional structure.

ValueError: redefine phi relevance function: all points are 0

I have a issue about redefine
my data have some issue( same x , but different y values)

rg_mtrx = [

    [35000,  1, 0],  ## over-sample ("minority")
    [125000, 0, 0],  ## under-sample ("majority")
    [200000, 0, 0],  ## under-sample
    [250000, 0, 0],  ## under-sample
]

oversample_2 = smogn.smoter(
    data = data_merge, 
    y = target,
    k = 7,                    ## positive integer (k < n)
    pert = 0.01,              ## real number (0 < R < 1)
    samp_method = 'balance',  ## string ('balance' or 'extreme')
    drop_na_col = True,       ## boolean (True or False)
    drop_na_row = True,       ## boolean (True or False)
    replace = False,        
    rel_thres = 0.10,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-92b5331e2604> in <module>
     23 #     rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
     24 #     rel_coef = 0.001,        ## unused (rel_method = 'manual')
---> 25     rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
     26 )

~/anaconda3/envs/pytorch/lib/python3.7/site-packages/smogn/smoter.py in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg)
    178 
    179     if all(i == 1 for i in y_phi):
--> 180         raise ValueError("redefine phi relevance function: all points are 0")
    181     ## ---------------------------------------------------------------------- ##
    182 

ValueError: redefine phi relevance function: all points are 0

import smogn
housing_smogn_2 = smogn.smoter(
    data = housing_smogn, 
    y = "SalePrice"
)
housing_smogn_3 = smogn.smoter(
    data = housing_smogn_2.reset_index(drop=True), 
    y = "SalePrice"
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-a679777bf14e> in <module>
      2 housing_smogn_3 = smogn.smoter(
      3     data = housing_smogn_2.reset_index(drop=True),
----> 4     y = "SalePrice"
      5 )

~/anaconda3/envs/pytorch/lib/python3.7/site-packages/smogn/smoter.py in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg)
    175     ## phi relevance quality check
    176     if all(i == 0 for i in y_phi):
--> 177         raise ValueError("redefine phi relevance function: all points are 1")
    178 
    179     if all(i == 1 for i in y_phi):

ValueError: redefine phi relevance function: all points are 1

please fix this :)

Multi-thread support.

Hello,

Thank you for sharing the smogn implementation. Could you please tell me if we can run the algorithm on multiple threads? Currently, it only runs on a single thread, so multi-thread would help in increasing the speed.

Take input as numpy arrays

I've got X, y. It's annoying to have to shove them into a numpy array, specifically rename the target column with a string name, pass smoter, and then pull arrays back out of the resulting dataframe. The input should be more flexible in my opinion.

Binary label

Hello,
I'm trying to use the smogn method in a dataset where the labels are either 0 (majority) or 1 (minority).
So my rg_matrix would be:
rg_mtrx = [
[1, 1, 0], ## over-sample ("minority")
[0, 0, 0], ## under-sample ("majority")
]
However, I'm getting an index error:

How can I do the augmentation for a dataset like that ?
Thanks!

redefine phi relevance function: all points are 0

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/smogn/phi.py:81: RuntimeWarning: divide by zero encountered in double_scalars
delta.append((y_rel[i + 1] - y_rel[i]) / h[i])
redefine phi relevance function: all points are 0

hide progress bar

Hi, thanks for adding the progress bar, it's nice to see it while balancing data once. But if we use it in the kfold nested loop, it makes a mess. If you can please add an option to either show or hide the progress bar :)

Cuda availability

Hi
Thanks for the amazing library. However, the number of data samples I have, is quite large. For now, it's going to take days to run this computation. Is it possible to run this code on google colab's gpu? Is your code automated to run on cuda?

UnboundLocalError: local variable 'a' referenced before assignment

Whenever I use smogn.smoter for my data set, the unbounderror pops up:

~\Anaconda3\lib\site-packages\smogn\smoter.py in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg)
243 perc = s_perc[i],
244 pert = pert,
--> 245 k = k
246 )
247

~\Anaconda3\lib\site-packages\smogn\over_sampling.py in over_sampling(data, index, perc, pert, k)
273
274 if len(feat_list_nom) > 0:
--> 275 a = a + sum(data.iloc[
276 i, feat_list_nom] != synth_matrix[
277 i * x_synth + j, feat_list_nom])

UnboundLocalError: local variable 'a' referenced before assignment

any ideas?

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

I am trying to run the smogn on my dataframe that has already dropna. The settings of the parameters are as below:

ztg = smoter(data=ztg.reset_index(drop=True), y='FINAL_MARKS', 
             samp_method = 'extreme',drop_na_col=True,drop_na_row=True, replace=False,
            rel_xtrm_type= 'high', rel_coef = 2.25)

But I still get the error 'IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer', any solution to resolve the issue pls?

nominal / categorical features

How are independent nominal / categorical features are identified and differentiated from numerical features? Is there is a way to pre-specify them?

Resampled Data contains missing values

When resampling data I randomly get the ValueError "oops! synthetic data contains missing values".
Is there anything to prevent that from happening, e.g. change any parameters, or is this dependant
on the input DataFrame?
Furthermore how severe is this error or would it be possible to simply remove the affected rows?

Additional variables

I was thinking it would be useful to be able to specify variables in the data that are neither the target variable nor variables used to perform the resampling, and would just be passed through. A practical use case are models that use offsets or data that contains IDs etc that might be useful for building cross-validation folds with matching unsampled data. Thoughts?

Resampling with label uniformity and user uniformity

Hi,

I have a regression problem, so the label is a single floating-point number within a well-defined range (e.g. [0, 1]). The label distribution is non-uniform: namely, there is markedly less data at the edges, but also in the very middle of the range. So far, a classical problem for SMOGN. However, I sample data from multiple users, and there is also a huge imbalance in amount of data among users. I would prefer that all users are well-represented in the training set in addition to balancing the label range distribution. Thus, I would prefer that the algorithm is aware of user labels, and tries to undersample users with a lot of data and preserve or oversample users with little data. Is this currently possible? Do you have suggestions?

Over-sampling

Hi,
I am working on a regression problem, and I want to use SMOTER' the problem is that I don't understand how to oversample my data significantly. My input dataframe size is [716,3457], and the output is about the same size ([1068,3457]).
I read the function and the examples, and couldn't understand how to do it.
Specifically, I using DeepSMOTE method to create additional synthetic signals, so the oversampling is done on the latent space after the encoder.
Thanks,
Sharon

nickkunz / smogn Goto Github PK

smogn's Introduction

About Me

Links

Stats

smogn's People

Contributors

Stargazers

Watchers

Forkers

smogn's Issues

Here are the arrays I'm talking about

Recommend Projects

Recommend Topics

Recommend Org