Comments (6)
I'm still not sure what causes it (it seems to happen intermittently when I switch samp_method from 'extreme' to 'balanced'), but it's seemingly-random (in a dataset with ~300 columns and ~150 rows, no missing entries). I've had cases where I've run the exact same code twice, and one time it failed, the other time it didn't.
Here's a bit of an ugly workaround. Given the seemingly-random nature of it (I've never seen it happen more than twice in a row), you could wrap the smoter in a while block like this:
n_tries = 0
done = False
while not done:
try:
smogn.smoter(data=foo, y="bar")
done = True
except ValueError:
if n_tries < 5:
n_tries += 1
else:
raise
That said, keep in mind that this will still fail if there really is something wrong with the dataset you're feeding smogn. This only helps when things fail seemingly-randomly.
from smogn.
Same here
from smogn.
Thank you for using SMOGN. Have you ensured that you have a valid pandas dataframe with no missing values?
from smogn.
Great idea implementing this, thanks.
Yes, absolutely, I explicitly dropna() all columns before hand. A call to df.isna().sum().sum() throws 0 (zero). Aren't t NaNs anyway dropped by default?
If I choose just the first 4000 rows from the df, as opposed to the full 15000, smogn runs fine.
I am yet to find what is exactly in the df it doesn't like.
Update: choosing samp_method to 'extreme' and rel_xtrm_type to 'high' makes the method run without errors over the same df. If I change the rel_thress from the default to 0.3 though, it sends the same error. It definitely isn't a dataset problem.
from smogn.
This only occurs for me in a specific circumstance:
import smogn
import pandas as pd
df = pd.DataFrame(pt.transform_x(X_train), columns=labels)
df['logl'] = pt.transform_y(y_train)
oversampled_train = smogn.smoter(data = df.iloc[:N], y = 'logl', samp_method = 'extreme')
When N is 1000, I get that error, when N=1001 it works....
This happens for different values of N dependent on the choice of parameters
from smogn.
I encountered this error too but had verified that the dataframe fed to smogn.smoter did not have any missing values hence not sure why this error appears. Any solution to pass over the error? Moreover, when the data > 2000 rows, the distmatrix is very slow, any way to speed things up?
from smogn.
Related Issues (20)
- Using Smogn only reducing number of observations
- IndexError: positional indexers are out-of-bounds HOT 1
- Take input as numpy arrays HOT 2
- SMOGN with `under_samp`=False fails to return original data
- some features are missing after resampling
- Cuda availability HOT 2
- Could you explain what exactly is the `rel_coef` argument? HOT 2
- How to specify resampling range? HOT 2
- Reducing verboseness HOT 2
- Handling categorical features
- Error during running advanced ex3
- SMOGN is creating a new class for target HOT 2
- Resampling with label uniformity and user uniformity
- Hyperparameter optimization
- Reproduceability of smoter HOT 1
- The possibility of applying this method in the field of images HOT 1
- Over-sampling HOT 1
- Binary label
- Documentation on the relevance value matrix HOT 3
- IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smogn.