autoviml / featurewiz Goto Github PK

Use advanced feature engineering strategies and select best features from your data set with a single line of code. Created by Ram Seshadri. Collaborators welcome.

License: Apache License 2.0

Python 100.00%

feature-engineering categorical-variables best-encoders feature-engg rfe rfecv feature-extraction feature-selection featuretools xgboost

featurewiz's Introduction

featurewiz

🔥 FeatureWiz, the ultimate feature selection library is powered by the renowned Minimum Redundancy Maximum Relevance (MRMR) algorithm. Learn more about it below.

Latest updates
Citation
Hightlights
Workings
Usage
Tips for using featurewiz
How to install featurewiz
Usage
API
Additional Tips
Maintainers
Contributing
License
Disclaimer

Latest

featurewiz 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the IterativeDoubleClassifier and the BlaggingClassifier. If you are looking for the latest and greatest updates about our library, check out our updates page.

Citation

If you use featurewiz in your research project or paper, please use the following format for citations:

"Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"

Current citations for featurewiz

Google Scholar citations for featurewiz

Highlights

featurewiz is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm.

What Makes FeatureWiz Stand Out? 🔍

✔️ Automatically select the most relevant features without specifying a number 🚀 Fast and user-friendly, perfect for data scientists at all levels 🎯 Provides a built-in categorical-to-numeric encoder 📚 Well-documented with plenty of examples 📝 Actively maintained and regularly updated

Simple tips for success using featurewiz 💡

📈 First create additional features using the feature engg module 🌐 Compare featurewiz against other feature selection methods for best performance ⚖️ Avoid overfitting by cross-validating your results as shown here 🎯 Try adding auto-encoders for additional features that may help boost performance

Feature Engineering

Create new features effortlessly with a single line of code. featurewiz enables you to generate hundreds of interaction, group-by, or target-encoded features, eliminating the need for expert-level skills.

What is MRMR?

featurewiz provides one of the best automatic feature selection algorithms, MRMR, described by wikipedia in this page as follows: "The MRMR feature selection algorithm has been found to be more powerful than the maximum relevance feature selection algorithm" Boruta.

How does MRMR feature selection work?🔍

After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or multi-correlated? Does your model suffer from or benefit from these new features? To answer these questions, two more steps are needed: ⚙️ SULOV Algorithm: The "Searching for Uncorrelated List of Variables" method ensures you're left with the most relevant, non-redundant features. ⚙️ Recursive XGBoost: featurewiz leverages XGBoost to repeatedly identify the best features among the selected variables after SULOV.

Advanced Feature Engineering Options

featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as:

Auto Encoders, including Denoising Auto Encoders (DAEs) Variational Auto Encoders (VAEs), CNN's (Convolutional Nueral Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets.

A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.

The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding

Examples and Updates

featurewiz is well-documented, and it comes with a number of examples
featurewiz is actively maintained, and it is regularly updated with new features and bug fixes

Workings

featurewiz has two major modules to transform your Data Science workflow:

1. Feature Engineering Module

Advanced Feature Creation: use Deep Learning based Auto Encoders and GAN's to extract features to add to your data. These powerful capabilities will help you in solving your toughest problems.

Options for Enhancement: Use "interactions", "groupby", or "target" flags to enable advanced feature engineering techniques.

Kaggle-Ready: Designed to meet the high standards of feature engineering required in competitive data science, like Kaggle.

Efficient and User-Friendly: Generate and sift through thousands of features, selecting only the most impactful ones for your model.

2. Feature Selection Module

MRMR Algorithm: Employs Minimum Redundancy Maximum Relevance (MRMR) for effective feature selection.

SULOV Method: Stands for 'Searching for Uncorrelated List of Variables', ensuring low redundancy and high relevance in feature selection.

Addressing Key Questions: Helps interpret new features, assess their importance, and evaluate the model's performance with these features.

Optimal Feature Subset: Uses Recursive XGBoost in combination with SULOV to identify the most critical features, reducing overfitting and improving model interpretability.

Comparing featurewiz to Boruta:

Featurewiz uses what is known as a Minimal Optimal algorithm while Boruta uses an All-Relevant algorithm. To understand how featurewiz's MRMR approach differs Boruta for comprehensive feature selection you need to see the chart below. It shows how the SULOV algorithm performs MRMR feature selection which provides a smaller feature set compared to Boruta. Additionally, Boruta contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.

Transform your feature engineering and selection process with featurewiz - the tool that brings expert-level capabilities to your fingertips!

Working

featurewiz performs feature selection in 2 steps. Each step is explained below. The working of the SULOV algorithm is as follows:

Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
Now take each pair of correlated variables, then knock off the one with the lower MIS score.
What’s left is the ones with the highest Information scores and least correlation with each other.

The working of the Recursive XGBoost is as follows: Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV.

Select all variables in the data set and the full data split into train and valid sets.
Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
Then take the next set of vars and find top X
Do this 5 times. Combine all selected features and de-duplicate them.

Tips

Here are some additional tips for ML engineers and data scientists when using featurewiz:

How to cross-validate your results: When you use featurewiz, we automatically perform multiple rounds of feature selection using permutations on the number of columns. However, you can perform feature selection using permutations of rows as follows in cross_validate using featurewiz.
Use multiple feature selection tools: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.
Don't forget to use Auto Encoders!: Autoencoders are like skilled artists who can draw a quick sketch of a complex picture. They learn to capture the essence of the data and then recreate it with as few strokes as possible. This process helps in understanding and compressing data efficiently.
Don't overfit your model: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.
Start with a small number of features: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.

Install

Prerequisites:

featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
We use "networkx" library for charts and interpretability.
But if you don't have these libraries, featurewiz will install those for you automatically.

In Kaggle notebooks, you need to install featurewiz like this (otherwise there will be errors):

!pip install featurewiz
!pip install Pillow==9.0.0
!pip install xlrd — ignore-installed — no-deps
!pip install executing>0.10.0

To install from source:

cd <featurewiz_Destination>
git clone [email protected]:AutoViML/featurewiz.git
# or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd featurewiz
pip install -r requirements.txt

Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps!

As of June 2022, thanks to arturdaraujo, featurewiz is now available on conda-forge. You can try:

 conda install -c conda-forge featurewiz

If the above conda install fails, you can try installing featurewiz this way:

Install featurewiz using git+

!pip install git+https://github.com/AutoViML/featurewiz.git

Usage

There are two ways to use featurewiz.

The first way is the new way where you use scikit-learn's `fit and predict` syntax. It also includes the `lazytransformer` library that I created to transform datetime, NLP and categorical variables into numeric variables automatically. We recommend that you use it as the main syntax for all your future needs.

from featurewiz import FeatureWiz
fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std",
        		category_encoders="auto", add_missing=False, verbose=0, imbalanced=False, 
                ae_options={})
X_train_selected, y_train = fwiz.fit_transform(X_train, y_train)
X_test_selected = fwiz.transform(X_test)
### get list of selected features ###
fwiz.features

The second way is the old way and this was the original syntax of featurewiz. It is still being used by thousands of researchers in the field. Hence it will continue to be maintained. However, it can be discontinued any time without notice. You can use it if you like it.

import featurewiz as fwiz
outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', 
		header=0, test_data='',feature_engg='', category_encoders='',
		dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False)

outputs is a tuple: There will always be two objects in output. It can vary:

In the first case, it can be features and trainm: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only)
In the second case, it can be trainm and testm: It can be two transformed dataframes when you send in both test and train but with selected features.

In both cases, the features and dataframes are ready for you to do further modeling.

Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want. You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically.

API

Input Arguments for NEW syntax

Parameters
----------
corr_limit : float, default=0.90
    The correlation limit to consider for feature selection. Features with correlations 
    above this limit may be excluded.

verbose : int, default=0
    Level of verbosity in output messages.

feature_engg : str or list, default=''
    Specifies the feature engineering methods to apply, such as 'interactions', 'groupby', 
    and 'target'. 

auto_encoders : str or list, default=''
    Five new options have been added recently to `auto_encoders` (starting in version 0.5.0): `DAE`, `VAE`, `DAE_ADD`, `VAE_ADD`, `CNN`, `CNN_ADD` and `GAN`. These are deep learning auto encoders (using tensorflow and keras) that can extract the most important patterns in your data and either replace your features or add them as extra features to your data. Try them for your toughest ML problems! See the notebooks folder for examples.

ae_options : dict, default={}
    You can provide a dictionary for tuning auto encoders above. Supported auto encoders include 'dae', 
    'vae', and 'gan'. You must use the `help` function to see how to send a dict to each auto encoder. You can also check out this <a href="https://github.com/AutoViML/featurewiz/blob/main/examples/Featurewiz_with_AutoEncoder_Demo.ipynb">Auto Encoder demo notebook</a>

category_encoders : str or list, default=''
    Encoders for handling categorical variables. Supported encoders include 'onehot', 
    'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc', 
    'loo', 'base', 'james', 'helmert', 'label', 'auto', etc.

add_missing : bool, default=False
    If True, adds indicators for missing values in the dataset.

dask_xgboost_flag : bool, default=False
    If set to True, enables the use of Dask for parallel computing with XGBoost.

nrows : int or None, default=None
    Limits the number of rows to process.

skip_sulov : bool, default=False
    If True, skips the application of the Super Learning Optimized (SULO) method in 
    feature selection.

skip_xgboost : bool, default=False
    If True, bypasses the recursive XGBoost feature selection.

transform_target : bool, default=False
    When True, transforms the target variable(s) into numeric format if they are not 
    already.

scalers : str or None, default=None
    Specifies the scaler to use for feature scaling. Available options include 
    'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'.

imbalanced : True or False, default=False
    Specifies whether to use SMOTE technique for imbalanced datasets.

Input Arguments for old syntax

dataname: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.
target: name of the target variable in the data set.
corr_limit: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal.
verbose: This has 3 possible states:
- 0 - limited output. Great for running this silently and getting fast results.
- 1 - verbose. Great for knowing how results were and making changes to flags in input.
- 2 - more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method.
test_data: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. test_data could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string.
dask_xgboost_flag: default False. If you want to use dask with your data, then set this to True.
feature_engg: You can let featurewiz select its best encoders for your data set by setting this flag for adding feature engineering. There are three choices. You can choose one, two, or all three.
- interactions: This will add interaction features to your data such as x1x2, x2x3, x12, x22, etc.
- groupby: This will generate Group By features to your numeric vars by grouping all categorical vars.
- target: This will encode and transform all your categorical features using certain target encoders.
  Default is empty string (which means no additional features)
add_missing: default is False. This is a new flag: the add_missing flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal.
category_encoders: default is "auto". Instead, you can choose your own category encoders from the list below. We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)
These descriptions are derived from the excellent category_encoders python library. Please check it out!
- HashingEncoder: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
- SumEncoder: SumEncoder is a Sum contrast coding for the encoding of categorical features.
- PolynomialEncoder: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.
- BackwardDifferenceEncoder: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.
- OneHotEncoder: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.
- HelmertEncoder: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.
- OrdinalEncoder: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.
- FrequencyEncoder: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.
- BaseNEncoder: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
- TargetEncoder: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper.
- CatBoostEncoder: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
- WOEEncoder: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.
- JamesSteinEncoder: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper. For feature value i, James-Stein estimator returns a weighted average of: The mean target value for the observed feature value i. The mean target value (regardless of the feature value).
nrows: default None. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask.
skip_sulov: default False. You can set the flag to skip the SULOV method if you want.
skip_xgboost: default False. You can set the flag to skip the Recursive XGBoost method if you want.

Output values for old syntax This applies only to the old syntax.

outputs: Output is always a tuple. We can call our outputs in that tuple as out1 and out2 below.
- out1 and out2: If you sent in just one dataframe or filename as input, you will get:
  - 1. features: It will be a list (of selected features) and
  - 1. trainm: It will be a dataframe (if you sent in a file or dataname as input)
- out1 and out2: If you sent in two files or dataframes (train and test), you will get:
  - 1. trainm: a modified train dataframe with engineered and selected features from dataname and
  - 1. testm: a modified test dataframe with engineered and selected features from test_data.

Additional

To learn more about how featurewiz works under the hood, watch this video

featurewiz was designed for selecting High Performance variables with the fewest steps. In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).

featurewiz is every Data Scientist's feature wizard that will:

Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can use deep learning to extract features with the click of a mouse. This is very helpful when you have imbalanced classes or 1000's of features to deal with. However, be careful with this option. You can very easily spend a lot of time tuning these neural networks.
Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
Build a fast XGBoost or LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.

*** Special thanks to fellow open source Contributors ***:

Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html

Maintainers

@AutoViML

Contributing

See the contributing file!

PRs accepted.

License

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

featurewiz's People

Contributors

Stargazers

Watchers

Forkers

valeman rizwandel ikrishnachowdary wanderlc kusumy goncaloperes globalhelpforall rohitpandey13 anhmike kiminh jingmouren wodole trendingtechnology geogubd yumeone statsai barrosm camladd techthiyanes latture hercules261188 rezabakhtiari dragarok yuliuwk webclinic017 codingwookie gmetsov shalevy1 b2metric-ai melihekici hudakas kwrprojects krishnabyggari94 ai-ahmed prateekchandrajha kothawadegs 17zhangw fjpa121197 gaminte kiburmsong profbressan amore07 aucan wxzquant python-repository-hub bassemfg benjaminye marr75 satwiksunnam19 lewisw kyledufrane carbirbal finarb eromoe thenicelander overfittingstudyroom 92rogercao boneyag vishalbelsare ibraheemshanshal akaiketech nosacapital zhiningliu1998 dakmatt marioernestovaldes drkdhong kilincali35 jfreedman340 jbarsotti dincaus ansari1375 snmz216 amit2014 yakov-karat mishrasamiksha gfggithubleet thefznkhan you-now-who chinmay7016 qxzsilver1 chrinide speedo8769 acosta360 lopsdir

featurewiz's Issues

Output exceeds the size limit. Open the full output data in a text editor

Hello, I meet this error while testing featurewiz , I want to do some auto feature engineering , so choose the old way , but unfortunately got Output exceeds the size limit. Open the full output data in a text editor .

Detail:

X shape : Shape = (128463, 1341) , mixed string, int , float and nan values.
code:

import featurewiz as FW
outputs = FW.featurewiz(dataname=X.reset_index(drop=True), target=y.reset_index(drop=True), corr_limit=0.70, verbose=2, sep=',', 
          header=0, test_data='',feature_engg='', category_encoders='',
          dask_xgboost_flag=False, nrows=None)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
f:\Work\jupyter_pipeline\pj01\1.1.0 clean_data.ipynb Cell 126 in <cell line: 1>()
      1 if Config.add_feature:
      2     # # Add feature
      3     # from jinshu_model.build_models import HighDimensionFeatureAdder
   (...)
      8     # ce = HighDimensionFeatureAdder(max_gmm_component=4, onehot=False)
      9     # X = ce.fit_transform(X)
     10     import featurewiz as FW
---> 11     outputs = FW.featurewiz(dataname=X.reset_index(drop=True), target=y.reset_index(drop=True), corr_limit=0.70, verbose=2, sep=',', 
     12             header=0, test_data='',feature_engg='', category_encoders='',
     13             dask_xgboost_flag=False, nrows=None)
     14 else:
     15     ce = CategoricalEncoder()

File c:\Users\ufo\anaconda3\lib\site-packages\featurewiz\featurewiz.py:793, in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
    791     print('Classifying features using a random sample of %s rows from dataset...' %nrows_limit)
    792     ##### you can use nrows_limit to select a small sample from data set ########################
--> 793     train_small = EDA_randomly_select_rows_from_dataframe(dataname, targets, nrows_limit, DS_LEN=dataname.shape[0])
    794     features_dict = classify_features(train_small, target)
    795 else:

File c:\Users\ufo\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2977, in EDA_randomly_select_rows_from_dataframe(train_dataframe, targets, nrows_limit, DS_LEN)
   2975     test_size = 0.9
...
-> 5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   5845 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Int64Index([0, 0, 0, 0, 1, 1, 0, 0, 0, 1,\n            ...\n            0, 0, 0, 0, 0, 0, 1, 0, 0, 0],\n           dtype='int64', length=128463)] are in the [columns]"

Verbose unchanged after initialization

The module initializes with verbose = 2, and is not changing to verbose = 0 even after setting the parameter.

Why train model on smaller and smaller set of features recursively

Hi,

I have some doubts on recursive xgboost model process. In the source code, it seems the models were trained on smaller and smaller set of features that were selected by their column index in an order.

   for i in range(0,train_p.shape[1],iter_limit):
        start_time2 = time.time()
        imp_feats = []
        if train_p.shape[1]-i < iter_limit:
            X_train = train_p.iloc[:,i:]
            cols_sel = X_train.columns.tolist()
        else:
            X_train = train_p[list(train_p.columns.values)[i:train_p.shape[1]]]
            cols_sel = X_train.columns.tolist()

Is there a reason to select the subset of features by column order? And why train models on shrinking set of features repeatedly?

Thank you

Getting error -> len(important_cats),len(final_list))) TypeError: object of type 'NoneType' has no len()

Hi,
I am trying to use the SULOV method to reduce the number of features from my dataset. The data I provide to the function is in a dataframe, all float type. Only the target variable is categorical (problem of classifying healthy and sick subjects).
I tried to give the data to the function both in a dataframe and with the path where the csv is. The result doesn't change.

this is the function call:
outputs = FW.featurewiz(path, "Healthy", corr_limit=0.70, sep=',', verbose=2, dask_xgboost_flag=False, nrows=None)

The algorithm gets to calculate and reduce the features, but then crashes with this error. What should I do to resolve it?

line 1386, in featurewiz
len(important_cats),len(final_list)))
TypeError: object of type 'NoneType' has no len()

How to enable GPU support?

Thanks again for this package! Though I do have an active GPU on the device, it doesn't seem to be detected. Is there some way of enabling GPU acceleration (and would it be useful)?

No GPU active on this device
    Tuning XGBoost using CPU hyper-parameters. This will take time...

Error installing from source

I got this issue when trying to install from source on Colab

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/AutoViML/featurewiz.git
  Cloning https://github.com/AutoViML/featurewiz.git to /tmp/pip-req-build-43tno_dg
  Running command git clone -q https://github.com/AutoViML/featurewiz.git /tmp/pip-req-build-43tno_dg
Requirement already satisfied: ipython in /usr/local/lib/python3.7/dist-packages (from featurewiz==0.1.95) (7.9.0)
Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting xgboost>=1.5.1
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
     |████████████████████████████████| 192.9 MB 74 kB/s 
Requirement already satisfied: pandas>=1.3.4 in /usr/local/lib/python3.7/dist-packages (from featurewiz==0.1.95) (1.3.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from featurewiz==0.1.95) (3.2.2)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from featurewiz==0.1.95) (0.11.2)
Collecting scikit-learn~=0.24
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
     |████████████████████████████████| 22.3 MB 1.2 MB/s 
ERROR: Could not find a version that satisfies the requirement networkx>=2.8.1 (from featurewiz) (from versions: 0.34, 0.35, 0.35.1, 0.36, 0.37, 0.99, 1.0rc1, 1.0, 1.0.1, 1.1, 1.2rc1, 1.2, 1.3rc1, 1.3, 1.4rc1, 1.4, 1.5rc1, 1.5, 1.6rc1, 1.6, 1.7rc1, 1.7, 1.8rc1, 1.8, 1.8.1, 1.9rc1, 1.9, 1.9.1, 1.10rc2, 1.10, 1.11rc1, 1.11rc2, 1.11, 2.0, 2.1, 2.2rc1, 2.2, 2.3rc3, 2.3rc4, 2.3, 2.4rc1, 2.4rc2, 2.4, 2.5rc1, 2.5, 2.5.1, 2.6rc1, 2.6rc2, 2.6, 2.6.1, 2.6.2, 2.6.3)
ERROR: No matching distribution found for networkx>=2.8.1

cannot replicate feature selection result

Is there a way to keep the feature selection result the same for every times I run? I tried to run the function and it's giving me different result for each time.

Original data - 365 variables
1st run - select 87 variables
2nd run - select 82 variables

Getting error: '<' not supported between instances of 'int' and 'str'

Hi I am trying to run a dataset which is having around 800000 rows and 42 columns but I am getting this error. given below:

TypeError Traceback (most recent call last)

in ()
43 Add_Poly=0, Stacking_Flag=False,
44 Imbalanced_Flag=True,
---> 45 verbose=1)
46
47

4 frames

<array_function internals> in unique(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/arraysetops.py in unique1d(ar, return_index, return_inverse, return_counts)
320 aux = ar[perm]
321 else:
--> 322 ar.sort()
323 aux = ar
324 mask = np.empty(aux.shape, dtype=np.bool)

TypeError: '<' not supported between instances of 'int' and 'str'

Can Anyone, Please help me which column having problem in my dataset. I am not getting any clue from error. Please help me to resolve this error

AttributeError: 'int' object has no attribute 'split'

If I try this:

spectra.columns = spectra.columns.astype(str)
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False,
							nrows=None, verbose=2)
X_train_selected = features.fit_transform(spectra, mask_list)
selected_features = features.features

I get this error message:

Imported DASK version = 0.1.00. nrows=None uses all rows. Set nrows=1000 to randomly sample fewer rows.
output = featurewiz(dataname, target, corr_limit=0.70, verbose=2, sep=',', 
		header=0, test_data='',feature_engg='', category_encoders='',
		dask_xgboost_flag=False, nrows=None)
Create new features via 'feature_engg' flag : ['interactions','groupby','target']
############################################################################################
############       F A S T   F E A T U R E  E N G G    A N D    S E L E C T I O N ! ########
# Be judicious with featurewiz. Don't use it to create too many un-interpretable features! #
############################################################################################
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
    Shape of your Data Set loaded: (26717, 788)
    Caution: We will try to reduce the memory usage of dataframe from 80.23 MB
        memory usage after optimization is: 40.16 MB
        decreased by 50.0%
     Loaded. Shape = (26717, 788)
Traceback (most recent call last):
  File "/snap/pycharm-professional/271/plugins/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-professional/271/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/saskra/PycharmProjects/bmc/bmc5.py", line 121, in <module>
    X_train_selected = features.fit_transform(spectra, mask_list)
  File "/home/saskra/anaconda3/envs/bmc/lib/python3.9/site-packages/sklearn/base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/saskra/anaconda3/envs/bmc/lib/python3.9/site-packages/featurewiz/featurewiz.py", line 3553, in fit
    features, X_sel = featurewiz(df, target, self.corr_limit, self.verbose, self.sep, 
  File "/home/saskra/anaconda3/envs/bmc/lib/python3.9/site-packages/featurewiz/featurewiz.py", line 1029, in featurewiz
    dataname = remove_special_chars_in_names(dataname, target, verbose=1)
  File "/home/saskra/anaconda3/envs/bmc/lib/python3.9/site-packages/featurewiz/featurewiz.py", line 3586, in remove_special_chars_in_names
    sel_preds = ["_".join(x.split(" ")) for x in sel_preds]
  File "/home/saskra/anaconda3/envs/bmc/lib/python3.9/site-packages/featurewiz/featurewiz.py", line 3586, in <listcomp>
    sel_preds = ["_".join(x.split(" ")) for x in sel_preds]
AttributeError: 'int' object has no attribute 'split'
python-BaseException

The first line in my code was already a futile attempt to fix the supposed problem because the original column names in the dataframe were floating point numbers. Can anyone help?

Featurewiz cannot be used on Google Club

Featurewiz needs networkx>=2.8.1 which need Python >= 3.8 and Google Colab has python 3.7.

Featureviz for data with "No Target" Variable

Hi Team,

Thank you for creating this great library. I would like to know how can i modify the code to use it for my data which doesn't have Target or predictor. Like unsupervised data and i wanted to reduce the dimension of the techniques.

Please let me know,

Featurewiz key error during fit and transform

I am working on a ML project to select important features.

So, I am using the Featurewiz package from documentation here

I tried the below from github for my data

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
X_train_selected = features.fit_transform(ord_train_t, y_train)
X_test_selected = features.transform(ord_test_t) # error is encountered here
features.features  ### provides the list of selected features ###

Both ord_train_t and ord_test_t contain the same columns.

But I get a key error message when I try to use transform function after fit.

KeyError: "['Feat1', 'Feat2', 'Feat3', 'Feat5', 'Feat6', 'Feat7'] not in index"

But these columns are present in my ord_test_t data.

Is there anything wrong with the package or documentation? \

or am I using the fit and transform functions incorrectly?

find the full error below

    C:\Users\abcd\AppData\Local\Temp/ipykernel_11076/432759899.py in <module>
          2 features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
          3 X_train_selected = features.fit(ord_train_t, y_train)
    ----> 4 X_test_selected = features.transform(ord_test_t)
          5 features.features  ### provides the list of selected features ###
    
    ~\Anaconda3\lib\site-packages\featurewiz\featurewiz.py in transform(self, X)
       3562 
       3563     def transform(self, X):
    -> 3564         return X[self.features]
       3565 ###################################################################################################
       3566 import copy
    
    ~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
       3462             if is_iterator(key):
       3463                 key = list(key)
    -> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
       3465 
       3466         # take() does not accept boolean indexers
    
    ~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
       1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
       1313 
    -> 1314         self._validate_read_indexer(keyarr, indexer, axis)
       1315 
       1316         if needs_i8_conversion(ax.dtype) or isinstance(
    
    ~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
       1375 
       1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
    -> 1377             raise KeyError(f"{not_found} not in index")
       1378 
       1379 
    
        KeyError: "['Feat1', 'Feat2', 'Feat3', 'Feat5', 'Feat6', 'Feat7'] not in index"

Error--Some features created with datetime column are not in the index.

https://www.kaggle.com/pasuvulasaikiran/netflix-featurewiz

link to the kernel.

saving transformers?

Is it possible to save all data transformers used during feature selection in order to apply them to a new dataset?
If yes, what could be the process and how to reuse them?

Thanks a lot.
Your work is amanzing, have to say

Got KeyError on date/string columns [featurewiz 0.1.99]

Hello , after update to featurewiz 0.1.99 , I got different error .

Code is

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
X= features.fit_transform(X, y)
features.features  ### provides the list of selected features ###

traceback:

KeyError                                  Traceback (most recent call last)
Input In [71], in <cell line: 1>()
      8 from featurewiz import FeatureWiz
      9 features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
---> 10 X = features.fit_transform(X, y)
     11 cols = features.features  ### provides the list of selected features ###
     12 print(features.features)

File ~\anaconda3\lib\site-packages\sklearn\base.py:870, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
--> 870     return self.fit(X, y, **fit_params).transform(X)

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2934, in FeatureWiz.fit(self, X, y)
   2931     return {}, {}
   2932 #### Send target variable as it is so that y_train is analyzed properly ###
   2933 # Select features using featurewiz
-> 2934 features, X_sel = featurewiz(df, target, self.corr_limit, self.verbose, self.sep,
   2935         self.header, self.test_data, self.feature_engg, self.category_encoders,
   2936         self.dask_xgboost_flag, self.nrows)
   2937 # Convert the remaining column names back to integers and drop the
   2938 difftime = max(1, int(time.time()-start_time))

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:1101, in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1099     print('Since %s category encoding is done, dropping original categorical vars from predictors...' %feature_gen)
   1100     preds = left_subtract(preds, catvars)
-> 1101 train_p = train[preds]
   1102 if train_p.shape[1] <= 10:
   1103     iter_limit = 2

File ~\anaconda3\lib\site-packages\pandas\core\frame.py:3511, in DataFrame.__getitem__(self, key)
   3509     if is_iterator(key):
   3510         key = list(key)
-> 3511     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3513 # take() does not accept boolean indexers
   3514 if getattr(indexer, "dtype", None) == bool:

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
   5779 else:
   5780     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
   5784 keyarr = self.take(indexer)
   5785 if isinstance(key, Index):
   5786     # GH 42790 - Preserve name from an Index

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
   5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: "['network_type__first', 'device_model__first', 'ad_account__first', 'os_version__first', 'carrier__first', 'reg_week_day', 'os__first', 'hour__first', 'ad_source__first', 'ad_serving_user_group__first', 'firstecpm__first', 'province__first', 'manufacturer__first'] not in index"

These error columns are date/string type .

Not able to replicate results - seed not set for random

This issue was raised previously and was said to have been addressed but I am still getting inconsistent results.

I checked the source code. Seeds are provided for numpy's and others' random number generators but not for package random.

Can this please be fixed ASAP? Thank you so much.

TypeError: gen_cat_encodet_features() got an unexpected keyword argument 'fitted'

Hi,
I love featurewiz!! I got it to work using:

#outputs = featurewiz(df99, target='FSXRNE', corr_limit=0.70, verbose=2,
#header=0, test_data='',feature_engg='interactions')

and it worked really well! HOwever, when I use:

outputs = featurewiz(df99, target='FSXRNE', corr_limit=0.70, verbose=2,header=0, category_encoders='OneHotEncoder')

I get:

TypeError: gen_cat_encodet_features() got an unexpected keyword argument 'fitted'

Any ideas on what is going wrong? Thank you!

Sincerely,

tom

UnboundLocalError: local variable 'date_cols' referenced before assignment

UnboundLocalError                         Traceback (most recent call last)
/tmp/ipykernel_271453/4235074023.py in <module>
----> 1 X_train_selected = features.fit_transform(X_train, train_df['ground_truth_corrected'])

~/anaconda3/envs/XXX/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    853         else:
    854             # fit method of arity 2 (supervised transformation)
--> 855             return self.fit(X, y, **fit_params).transform(X)
    856 
    857 

~/anaconda3/envs/XXX/lib/python3.8/site-packages/featurewiz/featurewiz.py in fit(self, X, y)
   3613         #### Send target variable as it is so that y_train is analyzed properly ###
   3614         # Select features using featurewiz
-> 3615         features, X_sel = featurewiz(df, target, self.corr_limit, self.verbose, self.sep, 
   3616                 self.header, self.test_data, self.feature_engg, self.category_encoders,
   3617                 self.dask_xgboost_flag, self.nrows)

~/anaconda3/envs/XXX/lib/python3.8/site-packages/featurewiz/featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1720             print('    Could not revert column names to original. Try replacing them manually.')
   1721         print(f'Returning list of {len(important_features)} important features and a dataframe.')
-> 1722         if len(date_cols) > 0:
   1723             date_replacer = date_col_mappers.get  # For faster gets.
   1724             important_features1 = [date_replacer(n, n) for n in important_features2]

UnboundLocalError: local variable 'date_cols' referenced before assignment

I use featurewiz version 0.1.06.

Also I have no date columns, only int and float.

XGB crashes

I have a matrix (191,758) and it runs in to an error with featurewiz.

!Current number of predictors = 569 
    Finding Important Features using Boosted Trees algorithm...
        using 569 variables...
Finding top features using XGB is crashing. Continuing with all predictors...!

Help with feature_engineering and feature selection

As the @AutoViML said that the soul of the featurewiz is built to solve two problems

Feature Engineering
Feature Selection
As per the instructions given through my last issue @AutoViML has guided me with a code used to build the features using the code snippet described below 👍

trainm, testm = FW.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', header=0, test_data=test, feature_engg='', category_encoders='',dask_xgboost_flag=False, nrows=None)

This snippet seems working but it's not producing any feature-engineered features(new features from existing features) using the parameters given to "feature_engg" it's just performing the feature selection it's returning two data frames trainm&testm but with existing features. can @AutoViML help me with my doubts by giving solutions and giving straightforward snippets to feature selection and feature engineering(developing new features from existing ones)

I am thanking you in Advance!

UnboundLocalError: local variable 'params' referenced before assignment

/opt/conda/lib/python3.7/site-packages/featurewiz/featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
1130 param['nthread'] = -1
1131 param['tree_method'] = 'gpu_hist'
-> 1132 params['eta'] = 0.01
1133 params['subsample'] = 0.5
1134 params['grow_policy'] = 'depthwise' # 'lossguide' #

Is params getting set here instead of param? https://github.com/AutoViML/featurewiz/blob/main/featurewiz/featurewiz.py

On transforming after fit_transform

Hi, its me again :)
I tried your new feature of compatibility with scikit-learn following the suggested code where i found the line of transformation over X_test

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', 
dask_xgboost_flag=False, nrows=None, verbose=2)
X_train_selected = features.fit_transform(X_train, y_train)
####################################  THIS LINE I'M TALKING ABOUT
X_test_selected = features.transform(X_test)
############################################################
features.features  ### provides the list of selected features ###

But I found that what features.transform(X_test) does is "only" filtering X_test by the selected features since this is the code inside FeatureWiz class:

def transform(self, X):
        return X[self.features]

Specifically, what I'm trying to do is:

to use features (a FeatureWiz object type) to get the completely transformed and filtered dataset according to the selected features
save the features object
to train a X model using the dataset found in step (1)
to load the features object saved in step (2)
in real life to receive an input and transform it using the loaded features object in step (4)
to feed my model X with the transformed input in step (5)

I just dont know how to complete the step (5) since any input is just filtered and not transformed
I wonder if not I should use the transform function of the My_Groupby_Encoder class. If that is true, how could I do that?

Thank you so much for your attention to my question
I don't hesitate to say that your work is simply wonderful and useful

Getting memory error while memory is free

I run featurewiz on Google Colab and get an error about memory, but there seems to be a lot of free memory.

- Free memory: 11918835712
- Requested memory: 25696

Full output:

############################################################################################
############       F A S T   F E A T U R E  E N G G    A N D    S E L E C T I O N ! ########
# Be judicious with featurewiz. Don't use it to create too many un-interpretable features! #
############################################################################################
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
**INFO: featurewiz can now read feather formatted files. Loading train data...
    Shape of your Data Set loaded: (6424, 12784)
    Caution: We will try to reduce the memory usage of dataframe from 626.61 MB
        memory usage after optimization is: 121.60 MB
        decreased by 80.6%
    Loaded train data. Shape = (6424, 12784)
loading the entire test dataframe - there is no nrows limit applicable #########
    Shape of your Data Set loaded: (1134, 12784)
    Loaded test data. Shape = (1134, 12784)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    12783 Predictors classified...
        4 variable(s) to be removed since ID or low-information variables
    	variables removed = ['_12544', '_12560', '_13520', '_13568']
train data shape before dropping 4 columns = (6424, 12784)
	train data shape after dropping columns = (6424, 12780)
    Converted pandas dataframe into a Dask dataframe ...
    Converted pandas dataframe into a Dask dataframe ...
GPU active on this device
    Tuning XGBoost using GPU hyper-parameters. This will take time...
    After removing redundant variables from further processing, features left = 12779
No interactions created for categorical vars since feature engg does not specify it
#### Single_Label Multi_Classification problem ####
    Skipping SULOV method since data dimension 82 m > 50 m. Continuing ...
Time taken for SULOV method = 0 seconds
    Adding 0 categorical variables to reduced numeric variables  of 12779
Final list of selected vars after SULOV = 12779
Readying dataset for Recursive XGBoost by converting all features to numeric...
#######################################################################################
#####    R E C U R S I V E   X G B O O S T : F E A T U R E   S E L E C T I O N  #######
#######################################################################################
    using regular XGBoost
Train and Test loaded into Dask dataframes successfully after feature_engg completed
Current number of predictors = 12779 
    XGBoost version: 1.6.0
Number of booster rounds = 100
        using 12779 variables...
Regular XGBoost is crashing due to: [14:14:47] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: [14:14:47] ../src/c_api/../data/../common/common.h:46: ../src/common/device_helpers.cuh: 447: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x38f399) [0x7fab1fcf6399]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x393333) [0x7fab1fcfa333]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3d340e) [0x7fab1fd3a40e]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e7374) [0x7fab1fd4e374]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e91f0) [0x7fab1fd501f0]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x582179) [0x7fab1fee9179]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x20fb08) [0x7fab1fb76b08]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7fab1fa10758]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fab59f73dae]


- Free memory: 11918835712
- Requested memory: 25696

Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x38f399) [0x7fab1fcf6399]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3937ab) [0x7fab1fcfa7ab]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3d3549) [0x7fab1fd3a549]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e7374) [0x7fab1fd4e374]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e91f0) [0x7fab1fd501f0]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x582179) [0x7fab1fee9179]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x20fb08) [0x7fab1fb76b08]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7fab1fa10758]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fab59f73dae]


[14:14:47] ../src/c_api/../data/../common/device_helpers.cuh:428: Memory allocation error on worker 0: [14:14:47] ../src/c_api/../data/../common/common.h:46: ../src/common/device_helpers.cuh: 447: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x38f399) [0x7fab1fcf6399]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x393333) [0x7fab1fcfa333]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3d340e) [0x7fab1fd3a40e]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e7374) [0x7fab1fd4e374]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e91f0) [0x7fab1fd501f0]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x582179) [0x7fab1fee9179]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x20fb08) [0x7fab1fb76b08]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7fab1fa10758]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fab59f73dae]


- Free memory: 11918835712
- Requested memory: 25696

Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x38f399) [0x7fab1fcf6399]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3937ab) [0x7fab1fcfa7ab]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3d3549) [0x7fab1fd3a549]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e7374) [0x7fab1fd4e374]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x3e91f0) [0x7fab1fd501f0]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x582179) [0x7fab1fee9179]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(+0x20fb08) [0x7fab1fb76b08]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x68) [0x7fab1fa10758]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fab59f73dae]

Verbose

Fantastic and beautiful package.

is it possible I could run this package on a list of dataframes without having to close graphs even when I set verbose to 0?

[FEATURE REQUEST] - Limit the use of feature selection until SULOV part

Hello,

I was wondering if there could be a way to limit the automated feature selection part until after SULOV?

I'm trying to only get the output from removing low variance features and correlated features. Is this possible? I dont want to run the recursive XGBoost feature selection part.

Not sure if it is allow to change the code in my conda environment, and add a flag that returns the final_list variable after running SULOV?

[QUESTION] untransform encoded categorical values and change type of problem

Hello, I'm testing featurewiz with a dataframe with numerical and categorical variables, and a target variables that ranges from 0 - 55, with most of my values (for the target variable) between 0-6.

My first question comes to the fact that when I run:

outputs = FW.featurewiz(train_df, target='unique_offers_cut', feature_engg='', category_encoders='OneHotEncoder', dask_xgboost_flag=False, nrows=None, verbose=2)

Everything runs fine, but the final output is like this:

['OneHotEncoder_property_type_1',
 'OneHotEncoder_property_type_6',
 'OneHotEncoder_itv_region_10',
 'OneHotEncoder_itv_region_5',
 'OneHotEncoder_itv_region_8',
 'OneHotEncoder_listing_pricetype_12',
 'OneHotEncoder_property_type_3',
 'first_listed_price',
 'OneHotEncoder_property_type_4',...

Is there any change that I know what is property_type_1? Or at least have it transformed back to its original name?

On the other hand, for the type of problem, is there any way to override this? I do want to set it to a regression problem, but it is assuming the target variables as multi classification (and the XGBoost part ends up not working).

Thanks

getting Key error 0

Dask XGBoost is crashing. Continuing...

Hey, I am getting error as in the title, namely:

Dask XGBoost is crashing. Continuing...

I am trying to run featurewiz on data with 220 features and 63190 rows (and that is actually already shortened size) and I get the above error, but when I am trying to run it on 63190 x 10 (aka 10x amount of data), it did not give me any results yet, just either gets stuck or just takes so long and I never waited. I will try running it overnight/multiple days to see if it will be able to give any results but I doubt more data will work :D if less data does not.

dask_xgboost_error: 'Series' object has no attribute 'compute‘

Encountering the below error (environment WSL:Ubuntu) when trying to run with dask_xgboost_flag enabled

~/.local/lib/python3.8/site-packages/featurewiz/featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1340             if dask_xgboost_flag:
   1341                 ### since y_train is dask df and data_tuple.X_train is a pandas df, you can't merge them.
-> 1342                 y_test = y_test.compute()  ### remember you first have to convert them to a pandas df
   1343             data2 = data_tuple.X_test.join(y_test)
   1344             dataname = data1.append(data2)

~/.local/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'compute'

Cannot take a larger sample than population when 'replace=False'

Hello, when using your library I keep getting an error. I'm using the pima indians diabetes dataset (https://www.kaggle.com/uciml/pima-indians-diabetes-database) and the latest version of your library. I tried downgrading, but that didn't work either. Here's a screenshot of the problem:

Sample weight support for regression problems

Hello - I just saw this library written up on Medium and it looks very interesting. I wanted to ask about the possibility of adding Sample Weight support for it? Xgboost already has support for it via the weight parameter in the .fit() call, so I'm not sure what would be needed other than updating the API to allow a user to pass sample weights.

Thanks!

Versions of xlrd, tqdm, networkx, category_encoders not mentioned in the requirements

Question on nlp columns

Just a question
When a column in a dataset is considered nlp and when a categorical one?

I found this condition in your code:

def classify_columns(df_preds, verbose=0):
...

if train[col].map(lambda x: len(x) if type(x)==str else 0).mean(
                ) >= max_nlp_char_size and len(train[col].value_counts()
                        ) <= int(0.9*len(train)) and col not in string_bool_vars:
                var_df.loc[var_df['index']==col,'nlp_strings'] = 1

I wonder if it has not to be:

>= int(0.9*len(train))

Thanks,
Cheers

cannot convert float NaN to integer

I checked for nulls:
df_select.isnull().values.any()
False
I also tried
df_select = df_select.dropna()
and then:
target = 'target'
features, train = featurewiz(X, target, corr_limit=0.7, verbose=2, sep=",", header=1, test_data="", feature_engg="", category_encoders="")

ValueError Traceback (most recent call last)
in
2 target = 'target'
3
----> 4 features, train = featurewiz(X, target, corr_limit=0.7, verbose=2, sep=",", header=0, test_data="", feature_engg="", category_encoders="")

~\Anaconda3\lib\site-packages\featurewiz\featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, **kwargs)
1269 if len(numvars) > 1:
1270 final_list = FE_remove_variables_using_SULOV_method(train,numvars,settings.modeltype,target,
-> 1271 corr_limit,verbose)
1272 else:
1273 final_list = copy.deepcopy(numvars)

~\Anaconda3\lib\site-packages\featurewiz\featurewiz.py in FE_remove_variables_using_SULOV_method(df, numvars, modeltype, target, corr_limit, verbose)
605 corr_values = correlation_dataframe.values
606 col_index = correlation_dataframe.columns.tolist()
--> 607 index_triupper = list(zip(np.triu_indices_from(corr_values,k=1)[0],np.triu_indices_from(
608 corr_values,k=1)[1]))
609 high_corr_index_list = [x for x in np.argwhere(abs(corr_values[np.triu_indices(len(corr_values), k = 1)])>=corr_limit)]

<array_function internals> in triu_indices_from(*args, **kwargs)

~\Anaconda3\lib\site-packages\dask\array\core.py in array_function(self, func, types, args, kwargs)
1530 if da_func is func:
1531 return handle_nonmatching_names(func, args, kwargs)
-> 1532 return da_func(*args, **kwargs)
1533
1534 @Property

~\Anaconda3\lib\site-packages\dask\array\routines.py in triu_indices_from(arr, k)
1741 if arr.ndim != 2:
1742 raise ValueError("input array must be 2-d")
-> 1743 return triu_indices(arr.shape[-2], k=k, m=arr.shape[-1], chunks=arr.chunks)

~\Anaconda3\lib\site-packages\dask\array\routines.py in triu_indices(n, k, m, chunks)
1734 @derived_from(np)
1735 def triu_indices(n, k=0, m=None, chunks="auto"):
-> 1736 return nonzero(~tri(n, m, k=k - 1, dtype=bool, chunks=chunks))
1737
1738

~\Anaconda3\lib\site-packages\dask\array\creation.py in tri(N, M, k, dtype, chunks)
687
688 m = greater_equal.outer(
--> 689 arange(N, chunks=chunks[0][0], dtype=_min_int(0, N)),
690 arange(-k, M - k, chunks=chunks[1][0], dtype=_min_int(-k, M - k)),
691 )

~\Anaconda3\lib\site-packages\dask\array\creation.py in arange(*args, **kwargs)
377 chunks = kwargs.pop("chunks", "auto")
378
--> 379 num = int(max(np.ceil((stop - start) / step), 0))
380
381 dtype = kwargs.pop("dtype", None)

ValueError: cannot convert float NaN to integer

Prevent shuffling of data throughout featurewiz

Is it possible to prevent data shuffling throughout the featurewiz process?

My data has a temporal component (time series effectively) and shuffling doesn't make sense.

Perhaps there is a parameter already I can use to disable shuffling?

Thanks fro a great library.

method do transform only data, after fitted

hi !
thanks for writing this package, looks very interesting, I saw the article on medium

I am working with a time series dataset

I can run featwiz on the existing time series data but once a new observation comes I don't want to retrain and generate new features...I want to use the same features that were identified as relevant but just transform the raw data into the featwiz features

maybe you could have methods like sklearn: fit() , fit_transform(), transform()

you could write it as a class:
features = FeatureWiz( corr_limit=0.70,verbose=2,feature_engg=["interactions","groupby","target"])
output_train=features .fit_transform(train,target)
output_test=features .transform(test)
relevant_features=features .get_feat_list()

also maybe you could have a method to get the feature importances, like MI scores or permutation importance in the test dataset

and the plot it is very nice but running this featurewiz as a background process the plot should not come up, there should be an option to switch off or have it as a method :features.make_plot()

Why test has not target and how to use model to make predict

hello,
I have three questions:
1: When I use train, test = FW.featurewiz() , i find test return have not encoded target,but I need recalculate balanced_accuracy_score that need encoded target, so I add target in return, Is this correct？

2: How to use fitted model to predict? I get outputs[-1] as fitted model, but its predictions is all 1, its not the same as outputs[0]。This is my code, am I using it wrong? By the way ,Can you provide some examples for using simple_LightGBM_model ,simple_XGBoost_model, etc.

3: When I use fitted model to predict raw data, how can I get transformer for raw data?

Hope to get your reply，thanks!

ValueError: Columns must be same length as key when `len(date_cols) > 0`

This error is from pandas
throw at

This happened when len(date_cols)>0 .

I found important_features doesn't equal to old_important_features :

In [4]: 'reg_date_hour' in important_features
Out[4]: True

In [5]: 'reg_date_hour' in old_important_features
Out[5]: False

In [6]: len(date_cols)
Out[6]: 2

In [14]: len(important_features) == len(old_important_features)
Out[14]: False

In [15]: len(important_features)
Out[15]: 460

In [16]: len(old_important_features)
Out[16]: 471

Add package to conda-forge

It would be great to see featurewiz to conda-forge. Some packages rely only on conda distirbution so other packages distritbutions don't get messed up. I'm not that familiar using github (in fact this is my first time creating an issue).

https://conda-forge.org/#add_recipe
https://github.com/conda-forge/staged-recipes
https://github.com/conda-incubator/grayskull

Requirements.txt version mistake?

Should the requirements for some libraries be >= as opposed to ~=

featurewiz 0.1.991 requires Pillow~=9.0.0, but you have pillow 9.2.0 which is incompatible.
featurewiz 0.1.991 requires scikit-learn~=0.24, but you have scikit-learn 1.1.2 which is incompatible.

For scikit-learn, 0.24 is very old. And you're preventing more recent versions of scikit-learn from being usable.

Am I misunderstanding?

verbose=0 ?

verbose=0 is not silent?
How can I get featurewiz to work without outputting a SULOV seaborn plot?

Error when running with rows greater than 9999

I am running into an error when the rows of the dataframe to be feature reduced is greater than 9999. The stack trace is shown below:

outputs = featurewiz(dataset.iloc[:10500,:], collist, corr_limit=0.93, verbose=1, dask_xgboost_flag=False)

Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
Shape of your Data Set loaded: (10500, 2880)
Loading test data...
No file given. Continuing...
Classifying features using 10000 rows...
loading a random sample of 10000 rows into pandas for EDA

ValueError Traceback (most recent call last)
/tmp/ipykernel_17760/2016861536.py in
----> 1 outputs = featurewiz(dataset.iloc[:10500,:], collist, corr_limit=0.93, verbose=1, dask_xgboost_flag=False)

~/miniconda3/envs/CS280/lib/python3.7/site-packages/featurewiz/featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
1082 targets = copy.deepcopy(target)
1083 ##### you can use
-> 1084 train_small = select_rows_from_dataframe(dataname, targets, nrows_limit, DS_LEN=dataname.shape[0])
1085 features_dict = classify_features(train_small, target)
1086 else:

/miniconda3/envs/CS280/lib/python3.7/site-packages/featurewiz/featurewiz.py in select_rows_from_dataframe(train_dataframe, targets, nrows_limit, DS_LEN)
3986 list_of_few_classes = train_dataframe[each_target].value_counts()[train_dataframe[each_target].value_counts()<=10].index.tolist()
3987 train_small = train_dataframe.loc[(train_dataframe[each_target].isin(list_of_few_classes))]
-> 3988 train_small, _ = train_test_split(train_dataframe, test_size=test_size, stratify=train_dataframe[targets])
3989 else:
3990 ### For Regression problems: load a small sample of data into a pandas dataframe ##

~/miniconda3/envs/CS280/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2439 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
2440
-> 2441 train, test = next(cv.split(X=arrays[0], y=stratify))
2442
2443 return list(

~/miniconda3/envs/CS280/lib/python3.7/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
1598 """
1599 X, y, groups = indexable(X, y, groups)
-> 1600 for train, test in self._iter_indices(X, y, groups):
1601 yield train, test
1602

~/miniconda3/envs/CS280/lib/python3.7/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
1939 if np.min(class_counts) < 2:
1940 raise ValueError(
-> 1941 "The least populated class in y has only 1"
1942 " member, which is too few. The minimum"
1943 " number of groups for any class cannot"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I have tried replicating it over a dataframe with random numbers only and the same happened.

Memory Runout Error for even less memory of dataset when using Featurewiz

I have a dataset whose memory is around (440.06 MiB) training data and test data (219.6 MiB) when I try to use the featurewiz with this dataset it's showing an error of memory run out on GPU (Kaggle and Google colab)

Is there any method to solve this problem, rather than going to cloud platforms?
Is there any method to free the space when working with featurewiz internally in the codes of featurewiz?
I have loaded the dataset into the GPU environment and loaded those direct data frames into fearurewiz showing an error.

Dataset can be found at "https://www.kaggle.com/competitions/ventilator-pressure-prediction/data"

"left_subtract" not defined in SULOV

Hello,

When I execute
outputs = featurewiz.featurewiz(features_train.join(y_train), "label", corr_limit=0.7, verbose=1)

I receive the error message

#######################################################################################
#####  Searching for Uncorrelated List Of Variables (SULOV) in 446 features ############
#######################################################################################
    there are no null values in dataset...
    SULOV Method crashing due to name 'left_subtract' is not defined
    SULOV method is erroring. Continuing ...
Time taken for SULOV method = 2 seconds
    Adding 0 categorical variables to reduced numeric variables  of 446
Final list of selected vars after SULOV = 446

Also, when I import the left_subtract function, it still doesnt work.
from featurewiz.featurewiz import left_subtract

What is the issue here?

Cannot find example .csv files

Hi, I'm trying to use one of your example test scripts

https://github.com/AutoViML/featurewiz/blob/main/examples/FeatureWiz_Test.ipynb

but I can't find the breast_cancer.csv file. Any chance you could add them to the repo?

how to work with with this in competative ml?

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U26'), dtype('int64')) -> None

thanks a lot for this package. It is very useful for me,

I am trying to follow a tutorial in hackernoon to select features from a dataset

When I execute the below code, I get an error like as shown below

from featurewiz import featurewiz
features, train = featurewiz(ord_train_t,y_train, corr_limit=0.7, verbose=2)

UFuncTypeError: ufunc 'add' did not contain a loop with signature
matching types (dtype('<U26'), dtype('int64')) -> None

However, I verified the dtypes for all my train data (ord_train_t) and target (y_train)

They all are of int64 and float64 types (as shown below) Don't understand why there is still an error. Even after converting float64 to int64, I get the same error. I also tried ord_train_t.isna().sum(), there are no NA's

[![enter image description here][5]][5]

Find below the full error

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
C:\Users\abcde1\AppData\Local\Temp/ipykernel_1888/1114387036.py in <module>
      1 from featurewiz import featurewiz
      2 
----> 3 features, train = featurewiz(ord_train_t,y_train, corr_limit=0.7, verbose=2)

~\Anaconda3\lib\site-packages\featurewiz\featurewiz.py in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1027     ##################    L O A D    T E S T   D A T A      ######################
   1028     dataname = remove_duplicate_cols_in_dataset(dataname)
-> 1029     dataname = remove_special_chars_in_names(dataname, target, verbose=1)
   1030     if dask_xgboost_flag:
   1031         train = remove_special_chars_in_names(train, target)

~\Anaconda3\lib\site-packages\featurewiz\featurewiz.py in remove_special_chars_in_names(df, target, verbose)
   3581     else:
   3582         sel_preds = [x for x in list(df) if x not in target]
-> 3583         df = df[sel_preds+target]
   3584     orig_preds = copy.deepcopy(sel_preds)
   3585     #####   column names must not have any special characters #####

~\Anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~\Anaconda3\lib\site-packages\pandas\core\arraylike.py in __radd__(self, other)
     94     @unpack_zerodim_and_defer("__radd__")
     95     def __radd__(self, other):
---> 96         return self._arith_method(other, roperator.radd)
     97 
     98     @unpack_zerodim_and_defer("__sub__")

~\Anaconda3\lib\site-packages\pandas\core\series.py in _arith_method(self, other, op)
   5524 
   5525         with np.errstate(all="ignore"):
-> 5526             result = ops.arithmetic_op(lvalues, rvalues, op)
   5527 
   5528         return self._construct_result(result, name=res_name)

~\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in arithmetic_op(left, right, op)
    222         _bool_arith_check(op, left, right)
    223 
--> 224         res_values = _na_arithmetic_op(left, right, op)
    225 
    226     return res_values

~\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in _na_arithmetic_op(left, right, op, is_cmp)
    164 
    165     try:
--> 166         result = func(left, right)
    167     except TypeError:
    168         if is_object_dtype(left) or is_object_dtype(right) and not is_cmp:

~\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, a, b, use_numexpr)
    237         if use_numexpr:
    238             # error: "None" not callable
--> 239             return _evaluate(op, op_str, a, b)  # type: ignore[misc]
    240     return _evaluate_standard(op, op_str, a, b)
    241 

~\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b)
     67     if _TEST_MODE:
     68         _store_test_result(False)
---> 69     return op(a, b)
     70 
     71 

~\Anaconda3\lib\site-packages\pandas\core\roperator.py in radd(left, right)
      7 
      8 def radd(left, right):
----> 9     return right + left
     10 
     11 

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U26'), dtype('int64')) -> None

Columns must be same length as key

Hi there,
I am new to featurewize and I have got a wired error while I am using it. It happened after feature selection got completed.
I have followed the instruction and I am getting the following error.
I would be really grateful if anyone can help me.
Thanks in advance

feature_engg is not working properly

I'm trying to use this but it's showing an error when I'm using the "feature_eng" parameter used. I'm attaching some screenshots please help me with this. when I'm using that parameter it's showing that the new features that were added by featurewiz is not find in original dataset.

Features Selected by SULOV depend on the version of featurewiz

Dear all,
I was using featurewiz version 0.0.38 in an old project (https://github.com/AutoViML/featurewiz/tree/6b870dae8dcf4f24873eb61bb48947ceb84e189c)
The number of features selected and returned by FE_remove_variables_using_SULOV_method was 18

I am using featurewiz version 0.1.87 in a new project
The number of features selected and returned by FE_remove_variables_using_SULOV_method is 43

The dataset in input and the inputs parameters to featurewiz are the same for both projects. I executed step by step the two versions of featurewiz and I could say that the results of the two versions diverge starting from the computation of the correlation matrix at the beginning of FE_remove_variables_using_SULOV_method. Moreover, I noticed that the target label is modified by mlb = My_LabelEncoder() and dataname[each_target] = mlb.fit_transform(dataname[each_target]) in the new version before calling SULOV but it didn't happen in the previous version.

Could you clarify what main differences have been introduced in the new version? As you can imagine, such a significant difference in the outputs of the two versions is unpleasant.

Imported featurewiz: advanced feature engg and selection library. Version=0.0.38
output = featurewiz(dataname, target, corr_limit=0.70,
verbose=2, sep=',', header=0, test_data='',
feature_engg='', category_encoders='')
Create new features via 'feature_engg' flag : ['interactions','groupby','target']

Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Shape of your Data Set loaded: (38, 3385)
Filename is an empty string or file not able to be loaded
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
3384 Predictors classified...
2022-07-26 00:03:54.391147: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
142 variable(s) will be ignored since they are ID or low-information variables
Shape of your Data Set loaded: (38, 3385)
Number of processors on machine = 1
No GPU active on this device
Running XGBoost using CPU parameters
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
3384 Predictors classified...
142 variable(s) will be ignored since they are ID or low-information variables
Removing 142 columns from further processing since ID or low information variables
columns removed: ['x427', 'x433', 'x439', 'x771', 'x777', 'x783', 'x825', 'x831', 'x837', 'x850', 'x856', 'x862', 'x1216', 'x1228', 'x1240', 'x1248', 'x1254', 'x1260', 'x1273', 'x1279', 'x1285', 'x1289', 'x1301', 'x1313', 'x1617', 'x1623', 'x1629', 'x1633', 'x1639', 'x1645', 'x1651', 'x1657', 'x1663', 'x1671', 'x1677', 'x1683', 'x1696', 'x1702', 'x1708', 'x1712', 'x1724', 'x1736', 'x2056', 'x2062', 'x2068', 'x2074', 'x2080', 'x2086', 'x2094', 'x2100', 'x2106', 'x2119', 'x2125', 'x2131', 'x2135', 'x2147', 'x2159', 'x2479', 'x2491', 'x2503', 'x2517', 'x2523', 'x2529', 'x2542', 'x2548', 'x2554', 'x2558', 'x2570', 'x2582', 'x2965', 'x2971', 'x2977', 'x2981', 'x2993', 'x3005', 'x3325', 'x3337', 'x3349', 'x3363', 'x3369', 'x3375', 'x490', 'x495', 'x497', 'x502', 'x503', 'x507', 'x844', 'x913', 'x918', 'x920', 'x924', 'x925', 'x926', 'x936', 'x1267', 'x1268', 'x1330', 'x1332', 'x1334', 'x1336', 'x1341', 'x1343', 'x1347', 'x1348', 'x1349', 'x1351', 'x1360', 'x1361', 'x1753', 'x1755', 'x1757', 'x1759', 'x1764', 'x1766', 'x1770', 'x1771', 'x1772', 'x1778', 'x2176', 'x2178', 'x2180', 'x2182', 'x2187', 'x2189', 'x2194', 'x2195', 'x2207', 'x2599', 'x2601', 'x2603', 'x2605', 'x2610', 'x2616', 'x2617', 'x2959', 'x3022', 'x3024', 'x3026', 'x3028', 'x3039', 'x3046']
After removing redundant variables from further processing, features left = 3242

Single_Label Binary_Classification Feature Selection Started

Searching for highly correlated variables from 3242 variables using SULOV method

SULOV : Searching for Uncorrelated List Of Variables (takes time...)
Removing (3224) highly correlated variables:
Following (18) vars selected: ['x2', 'x10', 'x26', 'x69', 'x87', 'x129', 'x187', 'x417', 'x496', 'x554', 'x608', 'x975', 'x1033', 'x1156', 'x1765', 'x2134', 'x2910', 'x3176']

Imported version = 0.1.87.
from featurewiz import FeatureWiz
wiz = FeatureWiz(verbose=1)
X_train_selected = wiz.fit_transform(X_train, y_train)
X_test_selected = wiz.transform(X_test)
wiz.features ### provides a list of selected features ###

############################################################################################
############ F A S T F E A T U R E E N G G A N D S E L E C T I O N ! ########

Be judicious with featurewiz. Don't use it to create too many un-interpretable features!

############################################################################################
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
**INFO: featurewiz can now read feather formatted files. Loading train data...
Shape of your Data Set loaded: (38, 3385)
Loaded train data. Shape = (38, 3385)
No test data filename given...
#######################################################################################
######################## C L A S S I F Y I N G V A R I A B L E S ####################
#######################################################################################
Classifying variables in data set...
3384 Predictors classified...
142 variable(s) to be removed since ID or low-information variables
more than 142 variables to be removed; too many to print...
train data shape before dropping 81 columns = (38, 3385)
train data shape after dropping columns = (38, 3304)
Converted pandas dataframe into a Dask dataframe ...
No GPU active on this device
Tuning XGBoost using CPU hyper-parameters. This will take time...
After removing redundant variables from further processing, features left = 3242
No interactions created for categorical vars since feature engg does not specify it

Single_Label Binary_Classification problem
target labels need to be converted...
Completed label encoding of target variable = target
How model predictions need to be transformed for target:
{0: 1}
#######################################################################################

Searching for Uncorrelated List Of Variables (SULOV) in 3242 features

#######################################################################################
there are no null values in dataset...
Removing (3199) highly correlated variables:
SULOV method is erroring. Continuing ...
Time taken for SULOV method = 138 seconds
Adding 0 categorical variables to reduced numeric variables of 3242
Final list of selected vars after SULOV = 3242
Readying dataset for Recursive XGBoost by converting all features to numeric...