mljar / mljar-supervised Goto Github PK

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

Home Page: https://mljar.com

License: MIT License

Python 100.00%

automl machine-learning automatic-machine-learning mljar data-science scikit-learn hyperparameter-optimization feature-engineering xgboost random-forest

mljar-supervised's Introduction

New way for visual programming!

We are working on new way for visual programming. We developed desktop application called MLJAR Studio. It is a notebook based development environment with interactive code recipes and managed Python environment. All running locally on your machine. We are waiting for your feedback.

MLJAR Automated Machine Learning for Humans

Documentation: https://supervised.mljar.com/

Source Code: https://github.com/mljar/mljar-supervised

Looking for commercial support: Please contact us by email for details

Automated Machine Learning

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model 🏆. It is no black box, as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

The mljar-supervised will help you with:

explaining and understanding your data (Automatic Exploratory Data Analysis),
trying many different machine learning models (Algorithm Selection and Hyper-Parameters tuning),
creating Markdown reports from analysis with details about all models (Automatic-Documentation),
saving, re-running, and loading the analysis and ML models.

It has four built-in modes of work:

Explain mode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importance, and SHAP explanations of data,
Perform for building ML pipelines to use in production,
Compete mode that trains highly-tuned ML models with ensembling and stacking, with the purpose to use in ML competitions.
Optuna mode can be used to search for highly-tuned ML models should be used when the performance is the most important, and computation time is not limited (it is available from version 0.10.0)

Of course, you can further customize the details of each mode to meet the requirements.

What's good in it?

It uses many algorithms: Baseline, Linear, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Networks, and Nearest Neighbors.
It can compute Ensemble based on a greedy algorithm from Caruana paper.
It can stack models to build a level 2 ensemble (available in Compete mode or after setting the stack_models parameter).
It can do features preprocessing, like missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.
It can do advanced features engineering, like Golden Features, Features Selection, Text and Time Transformations.
It can tune hyper-parameters with a not-so-random-search algorithm (random-search over a defined set of values) and hill climbing to fine-tune final models.
It can compute the Baseline for your data so that you will know if you need Machine Learning or not!
It has extensive explanations. This package is training simple Decision Trees with max_depth <= 5, so you can easily visualize them with amazing dtreeviz to better understand your data.
The mljar-supervised uses simple linear regression and includes its coefficients in the summary report, so you can check which features are used the most in the linear model.
It cares about the explainability of models: for every algorithm, the feature importance is computed based on permutation. Additionally, for every algorithm, the SHAP explanations are computed: feature importance, dependence plots, and decision plots (explanations can be switched off with the explain_level parameter).
There is automatic documentation for every ML experiment run with AutoML. The mljar-supervised creates markdown reports from AutoML training full of ML details, metrics, and charts.

AutoML Web App with User Interface

We created a Web App with GUI, so you don't need to write any code 🐍. Just upload your data. Please check the Web App at github.com/mljar/automl-app. You can run this Web App locally on your computer, so your data is safe and secure 🐱

Automatic Documentation

The AutoML Report

The report from running AutoML will contain the table with information about each model score and the time needed to train the model. There is a link for each model, which you can click to see the model's details. The performance of all ML models is presented as scatter and box plots so you can visually inspect which algorithms perform the best 🏆.

The `Decision Tree` Report

The example for Decision Tree summary with trees visualization. For classification tasks, additional metrics are provided:

confusion matrix
threshold (optimized in the case of binary classification task)
F1 score
Accuracy
Precision, Recall, MCC

The `LightGBM` Report

The example for LightGBM summary:

Available Modes

In the docs you can find details about AutoML modes that are presented in the table.

Explain

automl = AutoML(mode="Explain")

It is aimed to be used when the user wants to explain and understand the data.

It is using 75%/25% train/test split.
It uses: Baseline, Linear, Decision Tree, Random Forest, Xgboost, `Neural Network' algorithms, and ensemble.
It has full explanations: learning curves, importance plots, and SHAP plots.

Perform

automl = AutoML(mode="Perform")

It should be used when the user wants to train a model that will be used in real-life use cases.

It uses a 5-fold CV.
It uses: Linear, Random Forest, LightGBM, Xgboost, CatBoost, and Neural Network. It uses ensembling.
It has learning curves and importance plots in reports.

Compete

automl = AutoML(mode="Compete")

It should be used for machine learning competitions.

It adapts the validation strategy depending on dataset size and total_time_limit. It can be: a train/test split (80/20), 5-fold CV or 10-fold CV.
It is using: Linear, Decision Tree, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Network, and Nearest Neighbors. It uses ensemble and stacking.
It has only learning curves in the reports.

Optuna

automl = AutoML(mode="Optuna", optuna_time_budget=3600)

It should be used when the performance is the most important and time is not limited.

It uses a 10-fold CV
It uses: Random Forest, Extra Trees, LightGBM, Xgboost, and CatBoost. Those algorithms are tuned by Optuna framework for optuna_time_budget seconds, each. Algorithms are tuned with original data, without advanced feature engineering.
It uses advanced feature engineering, stacking and ensembling. The hyperparameters found for original data are reused with those steps.
It produces learning curves in the reports.

How to save and load AutoML?

All models in the AutoML are saved and loaded automatically. No need to call save() or load().

Example:

Train AutoML

automl = AutoML(results_path="AutoML_classifier")
automl.fit(X, y)

You will have all models saved in the AutoML_classifier directory. Each model will have a separate directory with the README.md file with all details from the training.

Compute predictions

automl = AutoML(results_path="AutoML_classifier")
automl.predict(X)

The AutoML automatically loads models from the results_path directory. If you will call fit() on already trained AutoML then you will get a warning message that AutoML is already fitted.

Why do you automatically save all models?

All models are automatically saved to be able to restore the training after interruption. For example, you are training AutoML for 48 hours, and after 47 hours, there is some unexpected interruption. In MLJAR AutoML you just call the same training code after the interruption and AutoML reloads already trained models and finishes the training.

Supported evaluation metrics (`eval_metric` argument in `AutoML()`)

for binary classification: logloss, auc, f1, average_precision, accuracy- default is logloss
for multiclass classification: logloss, f1, accuracy - default is logloss
for regression: rmse, mse, mae, r2, mape, spearman, pearson - default is rmse

If you don't find the eval_metric that you need, please add a new issue. We will add it.

Fairness Aware Training

Starting from version 1.0.0 AutoML can optimize the Machine Learning pipeline with sensitive features. There are the following fairness related arguments in the AutoML constructor:

fairness_metric - metric which will be used to decide if the model is fair,
fairness_threshold - threshold used in decision about model fairness,
privileged_groups - privileged groups used in fairness metrics computation,
underprivileged_groups - underprivileged groups used in fairness metrics computation.

The fit() method accepts sensitive_features. When sensitive features are passed to AutoML, the best model will be selected among fair models only. In the AutoML reports, additional information about fairness metrics will be added. The MLJAR AutoML supports two methods for bias mitigation:

Sample Weighting - assigns weights to samples to treat samples equally,
Smart Grid Search - similar to Sample Weighting, where different weights are checked to optimize fairness metric.

The fair ML building can be used with all algorithms, including Ensemble and Stacked Ensemble. We support three Machine Learning tasks:

binary classification,
mutliclass classification,
regression.

Example code:

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from supervised.automl import AutoML

data = fetch_openml(data_id=1590, as_frame=True)
X = data.data
y = (data.target == ">50K") * 1
sensitive_features = X[["sex"]]

X_train, X_test, y_train, y_test, S_train, S_test = train_test_split(
    X, y, sensitive_features, stratify=y, test_size=0.75, random_state=42
)

automl = AutoML(
    algorithms=[
        "Xgboost"
    ],
    train_ensemble=False,
    fairness_metric="demographic_parity_ratio",  
    fairness_threshold=0.8,
    privileged_groups = [{"sex": "Male"}],
    underprivileged_groups = [{"sex": "Female"}],
)

automl.fit(X_train, y_train, sensitive_features=S_train)

You can read more about fairness aware AutoML training in our article https://mljar.com/blog/fairness-machine-learning/

Examples

👉 Binary Classification Example

There is a simple interface available with fit and predict methods.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)
X_train, X_test, y_train, y_test = train_test_split(
    df[df.columns[:-1]], df["income"], test_size=0.25
)

automl = AutoML()
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

AutoML fit will print:

Create directory AutoML_1
AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will optimize for metric: logloss
1_Baseline final logloss 0.5519845471086654 time 0.08 seconds
2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds
3_Linear final logloss 0.38139916864708445 time 3.19 seconds
4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds
5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds
6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 seconds
Ensemble final logloss 0.2731086821194617 time 1.43 seconds

the AutoML results in Markdown report
the Xgboost Markdown report, please take a look at amazing dependence plots produced by SHAP package 💖
the Decision Tree Markdown report, please take a look at beautiful tree visualization ✨
the Logistic Regression Markdown report, please take a look at coefficients table, and you can compare the SHAP plots between (Xgboost, Decision Tree and Logistic Regression) ☕

👉 Multi-Class Classification Example

The example code for classification of the optical recognition of handwritten digits dataset. Running this code in less than 30 minutes will result in test accuracy ~98%.

import pandas as pd 
# scikit learn utilites
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# mljar-supervised package
from supervised.automl import AutoML

# load the data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25,
    random_state=123
)

# train models with AutoML
automl = AutoML(mode="Perform")
automl.fit(X_train, y_train)

# compute the accuracy on test data
predictions = automl.predict_all(X_test)
print(predictions.head())
print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))

👉 Regression Example

Regression example on California Housing house prices data.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML # mljar-supervised

# Load the data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(housing.data, columns=housing.feature_names),
    housing.target,
    test_size=0.25,
    random_state=123,
)

# train models with AutoML
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

# compute the MSE on test data
predictions = automl.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, predictions))

👉 More Examples

Income classification - it is a binary classification task on census data
Iris classification - it is a multiclass classification on Iris flowers data
House price regression - it is a regression task on Boston houses data

FAQ

What method is used for hyperparameters optimization?

- For modes: `Explain`, `Perform`, and `Compete` there is used a random search method combined with hill climbing. In this approach, all checked models are saved and used for building Ensemble. - For mode: `Optuna` the Optuna framework is used. It uses using TPE sampler for tuning. Models checked during the Optuna hyperparameters search are not saved, only the best model is saved (the final model from tuning). You can check the details about checked hyperparameters from optuna by checking study files in the `optuna` directory in your AutoML `results_path`.

How to save and load AutoML?

The save and load of AutoML models is automatic. All models created during AutoML training are saved in the directory set in results_path (argument of AutoML() constructor). If there is no results_path set, then the directory is created based on following name convention: AutoML_{number} the number will be number from 1 to 1000 (depends which directory name will be free).

Example save and load:

automl = AutoML(results_path='AutoML_1')
automl.fit(X, y)

The all models from AutoML are saved in AutoML_1 directory.

To load models:

automl = AutoML(results_path='AutoML_1')
automl.predict(X)

How to set ML task (select between classification or regression)?

The MLJAR AutoML can work with:

binary classification
multi-class classification
regression

The ML task detection is automatic based on target values. There can be situation if you want to manually force AutoML to select the ML task, then you need to set ml_task parameter. It can be set to 'binary_classification', 'multiclass_classification', 'regression'.

Example:

automl = AutoML(ml_task='regression')
automl.fit(X, y)

In the above example the regression model will be fitted.

How to reuse Optuna hyperparameters?

You can reuse Optuna hyperparameters that were found in other AutoML training. You need to pass them in optuna_init_params argument. All hyperparameters found during Optuna tuning are saved in the optuna/optuna.json file (inside results_path directory).

Example:

optuna_init = json.loads(open('previous_AutoML_training/optuna/optuna.json').read())

automl = AutoML(
    mode='Optuna',
    optuna_init_params=optuna_init
)
automl.fit(X, y)

When reusing Optuna hyperparameters the Optuna tuning is simply skipped. The model will be trained with hyperparameters set in optuna_init_params. Right now there is no option to continue Optuna tuning with seed parameters.

How to know the order of classes for binary or multiclass problem when using predict_proba?

To get predicted probabilites with information about class label please use the predict_all() method. It returns the pandas DataFrame with class names in the columns. The order of predicted columns is the same in the predict_proba() and predict_all() methods. The predict_all() method will additionaly have the column with the predicted class label.

Documentation

For details please check mljar-supervised docs.

Installation

From PyPi repository:

pip install mljar-supervised

To install this package with conda run:

conda install -c conda-forge mljar-supervised

From source code:

git clone https://github.com/mljar/mljar-supervised.git
cd mljar-supervised
python setup.py install

Installation for development

git clone https://github.com/mljar/mljar-supervised.git
virtualenv venv --python=python3.6
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt

Running in the docker:

FROM python:3.7-slim-buster
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3-pip python3-dev
RUN pip3 -q install pip --upgrade
RUN pip3 install mljar-supervised jupyter
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]

Install from GitHub with pip:

pip install -q -U git+https://github.com/mljar/mljar-supervised.git@master

Demo

In the below demo GIF you will see:

MLJAR AutoML trained in Jupyter Notebook on the Titanic dataset
overview of created files
a showcase of selected plots created during AutoML training
algorithm comparison report along with their plots
example of README file and CSV file with results

Contributing

To get started take a look at our Contribution Guide for information about our process and where you can fit in!

Contributors

Cite

Would you like to cite MLJAR? Great! :)

You can cite MLJAR as follows:

@misc{mljar,
  author    = {Aleksandra P\l{}o\'{n}ska and Piotr P\l{}o\'{n}ski},
  year      = {2021},
  publisher = {MLJAR},
  address   = {\L{}apy, Poland},
  title     = {MLJAR: State-of-the-art Automated Machine Learning Framework for Tabular Data.  Version 0.10.3},
  url       = {https://github.com/mljar/mljar-supervised}
}

Would love to hear from you about how have you used MLJAR AutoML in your project. Please feel free to let us know at

License

The mljar-supervised is provided with MIT license.

Commercial support

Looking for commercial support? Do you need new feature implementation? Please contact us by email for details.

MLJAR

The mljar-supervised is an open-source project created by MLJAR. We care about ease of use in Machine Learning. The mljar.com provides a beautiful and simple user interface for building machine learning models.

mljar-supervised's People

Contributors

Stargazers

Watchers

Forkers

mmejdoubi wambagilles hongbopeng micseb degerli nanaakwasiabayieboateng jianglst leanderdulac awesome-archive muharremokutan jingmouren winstonqxy ideaplexus eruditepanda shafiahmed sarikayamehmet jxlijunhao live2pro sakampavankumar nikolayvoronchikhin locussam badreeshshetty ssgalitsky kingmbc mannyjop zorediak rafaelglikis meddulla letslego onlookerliu michaelneale bannel shiji203 amoonhappy taniasaleem14 fdoperezi krishnatray binoynandi45 emailhy rrchaudhari mtrawinska panashematsaudza peter-weizhang csand83 stjordanis ptesan777 fossabot wulftone eladmw jeisc amirstudy homo-sapiens-github zaxebo1 mindhash puuraj zeta1999 mreddy8182 qiuyufly ml-ai-nlp-ir shahules786 johnjdailey five-hundred-eleven diogosilva30 abtheo erik-white arm7ai eqilibruim-solutions osm1n slbinilkumar fc1988 tinomaxthayil svenvanpoucke cybernetics blaxe05 rushiv0609 partrita sambunaren mglowacki100 resperre phitotient suryathiru nickbsb sjoerdteunisse notabombe joy-jj rxflamel cxz robindang0573 kirito0918 vishalbelsare vieslink wzhao5 doubianimehdi jaidevjoshi83 bacoco mesumraza doytsujin german1728 databill86 massawe14

mljar-supervised's Issues

Add preprocessing for data with missing values in test set

There can be a situation where new data has missing values in columns that have all values in train dataset. In such case please:

warn the user about this
fill missing values with median

Seed (random_state) for reproducablity [enhancement]

'random_state' like in sklearn should be exposed to allow reproducability.

Add option to not shuffle rows

When dealing with time series data it is important to not shuffle rows when performing cross-validation. Would be good to have that as an option.

Saving mljar automl model for future use

Hi,
traditionally I had been using pickle package to save models in pkl file and re-use them continuously on live data. I see mljar model has to json and from json methods. Could you please create small poc or example with documentation as how could we re-use it for daily / live data? Thanks. :)

Add decision trees as learning algorithm

Decision trees are very good for data understanding despite poor accuracy.

set interface for learners fit(X,y)

Add callback to controll number of iterations

Add a simple callback to control the number of iterations.

Predicting probability for all models

Hello,
great AutoML tool!
But I have encountered a problem.

When the final model is RF, predict function returns classes (values are equal to 0 or 1).
But for other models (CatBoost, Xgboost, LightGBM, NN) , predict function returns probabilities.
Is it possible to get probabilities when the best model is Random Forest?

Use tree_method='hist' for Xgboost

Just a suggestion:

Consider using histogram method for Xgboost instead of default 'auto'. See this description:
dmlc/xgboost#1950

In practice it is 5-10x faster and leads to less overfitting.

Also helpful to limit thread usage to the number of physical cores of the system instead of default max virtual cores.

Set metric to be optimized

Set the metric to be optimized. Right now it is set to logloss

When trying to import AutoML it aborts

>>> import pandas as pd
>>> from supervised.automl import AutoML
/root/python/MLwebsite/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Aborted

this takes about 30 seconds to complete and ends in me returning to bash and not python. I am just attempting to run the quick example in the readme and I have python 3.6.3 in a brand new venv with nothing but this installed via pip.

Imbalanced classes in multi-class

To reproduce get iris data set
Add one new label in target column
Run the analysis, it should break because of one extra class during cross validation

Refactor preprocessing

Refactor PreprocessingStep code. Right now there are run and transform methods, please provide fit method instead run.

Compute additional metrics for regression

Need to add code for computing additional metrics for regression models.

Additional metrics:

MAE
MSE
RMSE
R^2

Add baseline algorithms

set path where to save models

Path should be set in config file. Right now the path is hard coded to '/tmp'.

add test for models

Add test for models

Each model should have test suite with check of:

fit
predict
save and load

Import error when installed in fresh venv

I had trouble importing the library in my current working environment, so I've created fresh venv for the test. Results were the same

Terminal output:

Successfully installed Keras-2.2.4 absl-py-0.9.0 astor-0.8.1 catboost-0.13.1 enum34-1.1.6 gast-0.3.2 grpcio-1.26.0 h5py-2.10.0 joblib-0.14.1 keras-applications-1.0.8 keras-preprocessing-1.1.0 lightgbm-2.2.3 markdown-3.1.1 mljar-supervised-0.1.7 mock-3.0.5 numpy-1.18.1 pandas-0.25.3 protobuf-3.11.2 python-dateutil-2.8.1 pytz-2019.3 pyyaml-5.3 scikit-learn-0.22.1 scipy-1.4.1 six-1.13.0 tensorboard-1.13.1 tensorflow-1.13.1 tensorflow-estimator-1.13.0 termcolor-1.1.0 tqdm-4.31.1 werkzeug-0.16.0 wheel-0.33.6 xgboost-0.80`

python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from supervised.automl import AutoML
...\automl\env\lib\site-packages\tqdm\_tqdm.py:605: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  from pandas import Panel
Traceback (most recent call last):
  File "..\automl\env\lib\site-packages\tqdm\_tqdm.py", line 613, in pandas
    from pandas.core.groupby.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'DataFrameGroupBy' from 'pandas.core.groupby.groupby' (...\automl\env\lib\site-packages\pandas\core\groupby\groupby.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\automl\env\lib\site-packages\supervised\__init__.py", line 3, in <module>
    from supervised.automl import AutoML
  File "...\automl\env\lib\site-packages\supervised\automl.py", line 10, in <module>
    tqdm.pandas()
  File "...\automl\env\lib\site-packages\tqdm\_tqdm.py", line 616, in pandas
    from pandas.core.groupby import DataFrameGroupBy, \
ImportError: cannot import name 'PanelGroupBy' from 'pandas.core.groupby' (C:\PKZ\Synced_dirs\Devel\Python\automl\env\lib\site-packages\pandas\core\groupby\__init__.py)
...

As per tqdm issue log, this seems to be fixed in newer version of this library, however pip installation of mljar-supervised installs the pre-fix release.

Maybe add one line deploy as REST API?

Add Linear and Logistic Regression support

Please check different scalers for it. https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py - robustscaler looks interesting

Add tests for reproducibility

Tests for models training and predictions reproducibility is needed.

Problem with results_path behavior

Hi Piotr. When setting model_path in AutoML definition, if path already exists this generates an error. System is looking for a .json file which does not exist - unless the model has been fit I assume. Better behavior might be that if path exists but no .json exists then proceed as if we had just created the directory?

Thanks,

create table with model details

Create a table with model details, like different metrics.

Problem with learner_time_limit

Hi Piotr:

Thanks for the changes made in the ner version. I'm now getting an error when setting learner_time_limit. Did you take this option out? I see in your code that you are setting it automatically as a function of total_time_limit. Can this be the behavior only if learner_time_limit is None?

Filter columns when doing predictions

When doing predictions please use only columns that were used for model building.

Select number of cross-validation folds

Would be great to be able to set the number of cross-val folds. On the MLJAR.com platform I always use 15. Ideally we can set any number as long as number of folds <= number of rows. thanks

feature importance [enhancement]

It'd be nice to have 'feature importance' exposed it the same way as in sklearn.

progress bar for training

Add progress bar for training

Add support for new data types

Right now there is support for numerical and categorical data types. There is a need to support more data types:

text (#128)
dates (#122)
IP
geo locations

too much memory consumption by xgboost

when running several xgboost algorithms in row, with dataset > 100 MB, the RAM memory consumption is growing very fast - looks like a bug

Compute more metrics for classifier

It will be nice to compute:

F1 score
AUC
Precision and Recall
Matthews correlation coefficient

Compute threshold for which maximize F1 score.
Provide confusion matrix.

Add LightGBM support

The lightgbm algorithm is already available in the code. Make sure that it works with:

binary classification
multiclass classification
regression

remove constant or empty columns

in preprocessing we should remove:

empty full rows
empty full columns
columns with constant value

Refactor AutoML.predict

How best to get diversity in final ensemble?

Hi. I'm running an experiment with standard XGboost parameters. In the end I am surprised to see that only 2 models are retained for the ensemble and usually one is significantly more weighted than the other (11 to 1). What parameters can I change in order to get more models in the ensemble (assuming just XGboost? So far I have tried 10 initial and 15 initial models as well as 5 hill climbing steps and retain 5 models for improvements but with little difference. Thanks

warning when importing MLJ

from supervised.automl import AutoML generates this message:

/home/ubuntu/anaconda3/envs/mlj/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/externals/joblib/init.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=DeprecationWarning)

Thanks

Problem with cross validation

Hi Piotr:

Cross-validation no longer works (with or without shuffle). For instance:

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": False, "stratify": True}
used to work but now generates this error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models
Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

ValueError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in fit(self, X_train, y_train, X_validation, y_validation)
520
521 for params in generated_params:
--> 522 self.train_model(params)
523 # hill climbing
524 for params in tuner.get_hill_climbing_params(self._models):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/automl.py in train_model(self, params)
262 raise AutoMLException(f"Cannot create directory {model_path}")
263
--> 264 mf.train() # {"train": {"X": X, "y": y}})
265
266 mf.save(model_path)

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/model_framework.py in train(self)
107 np.random.seed(self.learner_params["seed"])
108
--> 109 self.validation = ValidationStep(self.validation_params)
110
111 for k_fold in range(self.validation.get_n_splits()):

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validation_step.py in init(self, params)
21
22 if self.validation_type == "kfold":
---> 23 self.validator = KFoldValidator(params)
24 else:
25 raise Exception("Other validation types are not implemented yet!")

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/supervised/validation/validator_kfold.py in init(self, params)
57
58 for fold_cnt, (train_index, validation_index) in enumerate(
---> 59 self.skf.split(X, y)
60 ):
61

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
728 to an integer.
729 """
--> 730 y = check_array(y, ensure_2d=False, dtype=None)
731 return super().split(X, y, groups)
732

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:

~/anaconda3/envs/mlj_shap_2/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": True, "stratify": True} also generates an error:

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models

ValueError Traceback (most recent call last)
in

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

thanks

set path for catboost snapshot

Refactor AutoML additional_metrics method

provide labels for true classes

When working with imbalanced datasets, a class may be underrepresented to the point where y_true and y_pred nearly always contain a different number of classes (for example, one class is missing from the predicted values). Because of this, mljar oftentimes cannot be used for imbalanced datasets.

I have attached the error below:

MLJAR AutoML:   0%|          | 0/80 [00:00<?, ?model/s]Traceback (most recent call last):
...
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 256, in fit
    self.not_so_random_step(X, y)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 207, in not_so_random_step
    m = self.train_model(params, X, y)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/automl.py", line 164, in train_model
    il.train({"train": {"X": X, "y": y}})
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/iterative_learner_framework.py", line 75, in train
    self.predictions(learner, train_data, validation_data),
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/callbacks/callback_list.py", line 23, in on_iteration_end
    cb.on_iteration_end(logs, predictions)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/callbacks/early_stopping.py", line 59, in on_iteration_end
    predictions.get("y_train_true"), predictions.get("y_train_predicted")
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/metric.py", line 58, in __call__
    return self.metric(y_true, y_predicted)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/supervised/metric.py", line 24, in logloss
    ll = log_loss(y_true, y_predicted)
  File "/home/shoe/.virtualenvs/2ravens/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1809, in log_loss
    lb.classes_))
ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]

Does it work with regression problem?

Does it work with regression problem? Because when I try to use, it considers labels as classes.
Could you show me a regression example in python?

Add tests for ensemble save and load

Add tests for ensemble save and load. It can be done:

by using some existing learner
or by writing simple learner framework mockup

Add validation with split

Add 2 new options for validation.

Validation with a split.
Validation with a separate dataset. (#101)

Docs for AutoML api

Add option to treat any column as categorical

Right now, only columns that are detected by the system as categorical will be converted to numbers. It will be nice to have an option to manually select which column should be treated as categorical.

Add mutliclass classification

Add support for multiclass classification. The machine learning task should be automatically detected or set manually in AutoML constructor.

Latest version error with 'start_random_models'

Just got this error during:

automl = AutoML(results_path="mlj_v2_res_1", total_time_limit=360,algorithms=model_types,train_ensemble=True,start_random_models=20,hill_climbing_steps=2,top_models_to_improve=2)

TypeError: init() got an unexpected keyword argument 'start_random_models'

plot learning curves for algorithms

Save learning curves for algorithms

final output can be confusing

This is an example output after all models are done running but there are no column headers to understand what this is.

0 10000000000000.0 0.6530156150511344
1 0.6530156150511344 0.6530156150511344
2 0.6530156150511344 0.6530156151127985
3 0.6530156150511344 0.6529712545924589
4 0.6529712545924589 0.652852026991749
5 0.652852026991749 0.6528075224163983
6 0.6528075224163983 0.6527930299503968

learning_curves chart does not display properly

The charts for each model are blank.
See attached. Thanks

mljar / mljar-supervised Goto Github PK

mljar-supervised's Introduction

New way for visual programming!

MLJAR Automated Machine Learning for Humans

Table of Contents

Automated Machine Learning

What's good in it?

AutoML Web App with User Interface

Automatic Documentation

The AutoML Report

The Decision Tree Report

The LightGBM Report

Available Modes

Explain

Perform

Compete

Optuna

How to save and load AutoML?

Example:

Train AutoML

Compute predictions

Why do you automatically save all models?

Supported evaluation metrics (eval_metric argument in AutoML())

Fairness Aware Training

Examples

👉 Binary Classification Example

👉 Multi-Class Classification Example

👉 Regression Example

👉 More Examples

FAQ

Documentation

Installation

Demo

Contributing

Contributors

Cite

License

Commercial support

MLJAR

mljar-supervised's People

Contributors

Stargazers

Watchers

Forkers

mljar-supervised's Issues

Add test for models

AutoML task to be solved: binary_classification AutoML will use algorithms: ['Xgboost'] AutoML will optimize for metric: logloss AutoML will try to check about 28 models

Recommend Projects

Recommend Topics

Recommend Org

The `Decision Tree` Report

The `LightGBM` Report

Supported evaluation metrics (`eval_metric` argument in `AutoML()`)

AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Xgboost']
AutoML will optimize for metric: logloss
AutoML will try to check about 28 models