Code Monkey home page Code Monkey logo

bayeswitnesses / m2cgen Goto Github PK

View Code? Open in Web Editor NEW
2.7K 50.0 232.0 1.25 MB

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

License: MIT License

Python 50.79% Java 3.80% C 3.20% Dockerfile 0.22% Makefile 0.10% Go 3.50% JavaScript 1.37% C# 3.69% Visual Basic .NET 3.69% VBA 0.14% PowerShell 4.45% Shell 0.03% R 3.56% PHP 3.21% Dart 4.19% Haskell 3.35% Ruby 3.03% F# 3.14% FreeBasic 0.40% Rust 4.15%
machine-learning scikit-learn statistical-learning xgboost lightgbm java python c javascript go csharp php r dartlang statsmodels lightning haskell ruby rust

m2cgen's Introduction

m2cgen

GitHub Actions Status Coverage Status License: MIT Python Versions PyPI Version Downloads

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#, Rust, Elixir).

Installation

Supported Python version is >= 3.7.

pip install m2cgen

Development

Make sure the following command runs successfully before submitting a PR:

make pre-pr

Alternatively you can run the Docker version of the same command:

make docker-build docker-pre-pr

Supported Languages

  • C
  • C#
  • Dart
  • F#
  • Go
  • Haskell
  • Java
  • JavaScript
  • PHP
  • PowerShell
  • Python
  • R
  • Ruby
  • Rust
  • Visual Basic (VBA-compatible)
  • Elixir

Supported Models

Classification Regression
Linear
  • scikit-learn
    • LogisticRegression
    • LogisticRegressionCV
    • PassiveAggressiveClassifier
    • Perceptron
    • RidgeClassifier
    • RidgeClassifierCV
    • SGDClassifier
  • lightning
    • AdaGradClassifier
    • CDClassifier
    • FistaClassifier
    • SAGAClassifier
    • SAGClassifier
    • SDCAClassifier
    • SGDClassifier
  • scikit-learn
    • ARDRegression
    • BayesianRidge
    • ElasticNet
    • ElasticNetCV
    • GammaRegressor
    • HuberRegressor
    • Lars
    • LarsCV
    • Lasso
    • LassoCV
    • LassoLars
    • LassoLarsCV
    • LassoLarsIC
    • LinearRegression
    • OrthogonalMatchingPursuit
    • OrthogonalMatchingPursuitCV
    • PassiveAggressiveRegressor
    • PoissonRegressor
    • RANSACRegressor(only supported regression estimators can be used as a base estimator)
    • Ridge
    • RidgeCV
    • SGDRegressor
    • TheilSenRegressor
    • TweedieRegressor
  • StatsModels
    • Generalized Least Squares (GLS)
    • Generalized Least Squares with AR Errors (GLSAR)
    • Generalized Linear Models (GLM)
    • Ordinary Least Squares (OLS)
    • [Gaussian] Process Regression Using Maximum Likelihood-based Estimation (ProcessMLE)
    • Quantile Regression (QuantReg)
    • Weighted Least Squares (WLS)
  • lightning
    • AdaGradRegressor
    • CDRegressor
    • FistaRegressor
    • SAGARegressor
    • SAGRegressor
    • SDCARegressor
    • SGDRegressor
SVM
  • scikit-learn
    • LinearSVC
    • NuSVC
    • OneClassSVM
    • SVC
  • lightning
    • KernelSVC
    • LinearSVC
  • scikit-learn
    • LinearSVR
    • NuSVR
    • SVR
  • lightning
    • LinearSVR
Tree
  • DecisionTreeClassifier
  • ExtraTreeClassifier
  • DecisionTreeRegressor
  • ExtraTreeRegressor
Random Forest
  • ExtraTreesClassifier
  • LGBMClassifier(rf booster only)
  • RandomForestClassifier
  • XGBRFClassifier
  • ExtraTreesRegressor
  • LGBMRegressor(rf booster only)
  • RandomForestRegressor
  • XGBRFRegressor
Boosting
  • LGBMClassifier(gbdt/dart/goss booster only)
  • XGBClassifier(gbtree(including boosted forests)/gblinear booster only)
    • LGBMRegressor(gbdt/dart/goss booster only)
    • XGBRegressor(gbtree(including boosted forests)/gblinear booster only)

    You can find versions of packages with which compatibility is guaranteed by CI tests here. Other versions can also be supported but they are untested.

    Classification Output

    Linear / Linear SVM / Kernel SVM

    Binary

    Scalar value; signed distance of the sample to the hyperplane for the second class.

    Multiclass

    Vector value; signed distance of the sample to the hyperplane per each class.

    Comment

    The output is consistent with the output of LinearClassifierMixin.decision_function.

    SVM

    Outlier detection

    Scalar value; signed distance of the sample to the separating hyperplane: positive for an inlier and negative for an outlier.

    Binary

    Scalar value; signed distance of the sample to the hyperplane for the second class.

    Multiclass

    Vector value; one-vs-one score for each class, shape (n_samples, n_classes * (n_classes-1) / 2).

    Comment

    The output is consistent with the output of BaseSVC.decision_function when the decision_function_shape is set to ovo.

    Tree / Random Forest / Boosting

    Binary

    Vector value; class probabilities.

    Multiclass

    Vector value; class probabilities.

    Comment

    The output is consistent with the output of the predict_proba method of DecisionTreeClassifier / ExtraTreeClassifier / ExtraTreesClassifier / RandomForestClassifier / XGBRFClassifier / XGBClassifier / LGBMClassifier.

    Usage

    Here's a simple example of how a linear model trained in Python environment can be represented in Java code:

    from sklearn.datasets import load_diabetes
    from sklearn import linear_model
    import m2cgen as m2c
    
    X, y = load_diabetes(return_X_y=True)
    
    estimator = linear_model.LinearRegression()
    estimator.fit(X, y)
    
    code = m2c.export_to_java(estimator)

    Generated Java code:

    public class Model {
        public static double score(double[] input) {
            return ((((((((((152.1334841628965) + ((input[0]) * (-10.012197817470472))) + ((input[1]) * (-239.81908936565458))) + ((input[2]) * (519.8397867901342))) + ((input[3]) * (324.39042768937657))) + ((input[4]) * (-792.1841616283054))) + ((input[5]) * (476.74583782366153))) + ((input[6]) * (101.04457032134408))) + ((input[7]) * (177.06417623225025))) + ((input[8]) * (751.2793210873945))) + ((input[9]) * (67.62538639104406));
        }
    }

    You can find more examples of generated code for different models/languages here.

    CLI

    m2cgen can be used as a CLI tool to generate code using serialized model objects (pickle protocol):

    $ m2cgen <pickle_file> --language <language> [--indent <indent>] [--function_name <function_name>]
             [--class_name <class_name>] [--module_name <module_name>] [--package_name <package_name>]
             [--namespace <namespace>] [--recursion-limit <recursion_limit>]
    

    Don't forget that for unpickling serialized model objects their classes must be defined in the top level of an importable module in the unpickling environment.

    Piping is also supported:

    $ cat <pickle_file> | m2cgen --language <language>
    

    FAQ

    Q: Generation fails with RecursionError: maximum recursion depth exceeded error.

    A: If this error occurs while generating code using an ensemble model, try to reduce the number of trained estimators within that model. Alternatively you can increase the maximum recursion depth with sys.setrecursionlimit(<new_depth>).

    Q: Generation fails with ImportError: No module named <module_name_here> error while transpiling model from a serialized model object.

    A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.

    Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.

    A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with float64 (double) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.

    m2cgen's People

    Contributors

    akhvorov avatar amfonelic avatar arshamg avatar aulust avatar bcampbell-prosper avatar dependabot-preview[bot] avatar dependabot[bot] avatar izeigerman avatar krinart avatar lucasavila00 avatar matbur avatar mattconflitti avatar mrshu avatar strikerrus avatar

    Stargazers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    Watchers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    m2cgen's Issues

    Flaky e2e test for the XGBoost model with the 'gblinear' booster

    Context from @StrikerRUS:

    Now Go is failing (refer to #200 (comment)):

    =================================== FAILURES ===================================
    _ test_e2e[xgboost_XGBClassifier - go_lang - train_model_classification_binary2] _
    estimator = XGBClassifier(base_score=0.6, booster='gblinear', colsample_bylevel=None,
                  colsample_bynode=None, colsamp...ambda=0, scale_pos_weight=1, subsample=None,
                  tree_method=None, validate_parameters=False, verbosity=None)
    executor_cls = <class 'tests.e2e.executors.go.GoExecutor'>
    
    ...
    
    expected=[0.04761511 0.9523849 ], actual=[0.047615, 0.952385]
    expected=[0.06296992 0.9370301 ], actual=[0.06297, 0.93703]
    expected=[0.12447995 0.87552005], actual=[0.124479, 0.875521]
    expected=[0.0757848 0.9242152], actual=[0.075784, 0.924216]
    expected=[0.8092151  0.19078489], actual=[0.809212, 0.190788]
    

    BTW, in attempts to check my guess from #200 (comment), I found that coefs in gblinear are also float32:
    https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L110

    https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L120

    https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L82

    https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L91

    and from #188 (comment) we know that bst_float is actually float
    https://github.com/dmlc/xgboost/blob/8d06878bf9b778db68ae98f68d99a3557c7ea885/include/xgboost/base.h#L110-L111

    Created dmlc/xgboost#5634.

    Support for LightGBM Booster and XGBoost Booster

    We're training our LightGBM model outside of python (spark) so we need to load it from a model file before passing it to m2c. I don't believe LightGBM can load directly into LGBMRegressor though, it must be loaded into lgb.Booster.

    It would be nice if m2cgen supported lgb.Booster

    Example

    import lightgbm as lgb
    import m2cgen as m2c
    
    model = lgb.Booster(model_file='model.txt')
    
    # this fails
    # m2c.export_to_java(model)
    
    # This works but is awkward 
    from lightgbm.sklearn import LGBMRegressor
    r = LGBMRegressor()
    r._Booster = model
    
    code = m2c.export_to_java(r)

    Remove numpy from default PythonInterpreter and potentially introduce PythonNumpyInterpreter

    Right now only Python uses third party library (specifically numpy) for linear algebra. This is inconsistent with:

    • our mission "with zero dependencies"
    • other languages.

    The first step was to drop numpy from cases without vectors, implemented in PR #111 .

    As the second step I want to suggest dropping numpy altogether from PythonInterpreter and potentially implement PythonNumpyInterpreter to use in cases where it would be beneficial.

    As for the user API I see 2 options:

    1. Adding a new method export_to_python_with_numpy
    2. Adding a arameter with_numpy to an existing export_to_python method which would be False by default.

    I personally think first option is better as users would have higher chances of noticing extra method than extra parameter with a default value.

    Planned support for sklearn pipelines?

    Is there any conceivable way to convert a pipeline that includes other steps like feature extractions, etc?

    I know this would be quite the undertaking if not currently supported. Just really love the idea of converting to no dependencies to move ML functionality to the edge. Great work!

    Thanks!

    PCA

    Could you support PCA transformation (as it's just a matrix multiplication when the algorithm is fitted)?

    import m2cgen as m2c error

    File "", line 1, in
    File "/anaconda2/lib/python2.7/site-packages/m2cgen/init.py", line 1, in
    from .exporters import export_to_java, export_to_python, export_to_c
    File "/anaconda2/lib/python2.7/site-packages/m2cgen/exporters.py", line 1, in
    from m2cgen import assemblers
    File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/init.py", line 1, in
    from .linear import LinearModelAssembler
    File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/linear.py", line 2, in
    from m2cgen.assemblers import utils
    File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/utils.py", line 36
    def apply_op_to_expressions(op, *exprs, to_reuse=False):
    ^
    SyntaxError: invalid syntax

    how to give input to the generated code

    Error: Main method not found in class Extratreesregressor, please define the main method as:
    public static void main(String[] args)

    or a JavaFX application class must extend javafx.application.Application
    How to run without main file and what is the input to the generated code?Kindly help im a newbie.

    NotImplementedError: Model int is not supported OpenNMT-py

    Traceback (most recent call last):
    File "/Library/Frameworks/Python.framework/Versions/3.6/bin/m2cgen", line 10, in
    sys.exit(main())
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/cli.py", line 85, in main
    print(generate_code(args))
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/cli.py", line 80, in generate_code
    return exporter(model, **kwargs)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/exporters.py", line 47, in export_to_python
    return _export(model, interpreter)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/exporters.py", line 70, in _export
    assembler_cls = assemblers.get_assembler_cls(model)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/assemblers/init.py", line 76, in get_assembler_cls
    "Model {} is not supported".format(model_name))
    NotImplementedError: Model int is not supported

    How to run generated java code without main method?

    Sorry, i newbie in java. But i need to run this code for ML program.

    how to run this code
    `public class Model {

    public static double score(double[] input) {
        return (((((((((((((36.45948838508965) + ((input[0]) * (-0.10801135783679647))) + ((input[1]) * (0.04642045836688297))) + ((input[2]) * (0.020558626367073608))) + ((input[3]) * (2.6867338193449406))) + ((input[4]) * (-17.76661122830004))) + ((input[5]) * (3.8098652068092163))) + ((input[6]) * (0.0006922246403454562))) + ((input[7]) * (-1.475566845600257))) + ((input[8]) * (0.30604947898516943))) + ((input[9]) * (-0.012334593916574394))) + ((input[10]) * (-0.9527472317072884))) + ((input[11]) * (0.009311683273794044))) + ((input[12]) * (-0.5247583778554867));
    }
    

    }`

    without main code, im a bit confusing. thanks!

    In Boosting Assembler wrapping each estimator into a subroutine causes a performance degradation

    I've recalled the real motivation behind not wrapping every individual estimator into its own subroutine - generation of many nested function calls leads to a performance degradation in Java. The observed difference reaches 4x for larger models (eg. XGBoost with 1000 estimators). The basic test I created (sorry about Scala):

    @ import com.github.m2cgen.ModelOld
    import com.github.m2cgen.ModelOld
    
    @ import com.github.m2cgen.ModelNew
    import com.github.m2cgen.ModelNew
    
    @ def nextRandomData(): Array[Double] = (0 until 4).map(_ => Random.nextDouble).toArray
    defined function nextRandomData
    
    @ def testScore: Unit = {
        val start = System.currentTimeMillis()
        (0 until 100000).foreach(_ => <ModelNew|ModelOld>.score(nextRandomData))
        println("Runtime: " + (System.currentTimeMillis() - start).toString)
      }

    Results for ModelOld:

    @ testScore
    Runtime: 2973
    

    For ModelNew:

    @ testScore
    Runtime: 10747
    

    The test model has been trained using the sklearn.datasets.load_iris() dataset. Classifier has been created as following:

    model = XGBClassifier(n_estimators=1000)
    

    In the attached archive I included the following:

    1. ModelNew.java - java code generated with the most recent master.
    2. ModelOld.java - java code generated with the release 0.5.0 version.
    3. Models.jar - the jar containing both compiled sources.
    4. xgboost_model2 - the trained estimator in Pickle format.

    CC: @StrikerRUS FYI

    sigmoid and softmax as language-specific functions

    Maybe it is better to require from supported languages to implement sigmoid and softmax functions than defining them as expressions? It will simplify the readability of generated code and speed it up by more efficient native implementations.

    Also, we can fallback to the current expressions when functions are missed.

    Why I'm rising this issue is because I think that we currently have some kind of inconsistency, requiring implementation of Tanh function from languages, but at the same time defining sigmoid as expression, while they both can be written via Exp function.

    image

    image

    def sigmoid_expr(expr, to_reuse=False):
    neg_expr = ast.BinNumExpr(ast.NumVal(0), expr, ast.BinNumOpType.SUB)
    exp_expr = ast.ExpExpr(neg_expr)
    return ast.BinNumExpr(
    ast.NumVal(1),
    ast.BinNumExpr(ast.NumVal(1), exp_expr, ast.BinNumOpType.ADD),
    ast.BinNumOpType.DIV,
    to_reuse=to_reuse)
    def softmax_exprs(exprs):
    exp_exprs = [ast.ExpExpr(e, to_reuse=True) for e in exprs]
    exp_sum_expr = apply_op_to_expressions(ast.BinNumOpType.ADD, *exp_exprs,
    to_reuse=True)
    return [
    ast.BinNumExpr(e, exp_sum_expr, ast.BinNumOpType.DIV)
    for e in exp_exprs
    ]

    Function Tanh(ByVal number As Double) As Double
    If number > 44.0 Then ' exp(2*x) <= 2^127
    Tanh = 1.0
    Exit Function
    End If
    If number < -44.0 Then
    Tanh = -1.0
    Exit Function
    End If
    Tanh = (Math.Exp(2 * number) - 1) / (Math.Exp(2 * number) + 1)
    End Function

    C# support

    Is there a roadmap for converting models to C# code? I work at a Microsoft shop and this would be great to use instead of ML.NET since this is so much more lightweight. Thanks!

    Reduce RAM and ROM footprint

    I'm using m2cgen to convert some classifier to C. It works great and results are consistent, thanks for the library!

    1. I have the problem that the compiled binaries are too large to fit on my embedded device. I checked and the binaries are around double the size of the binaries created with e.g sklearn_porter. However, m2cgen is the only libraries that can convert my python classifiers to C without introducing errors into the classification.
    2. Even if I reduce the size of the classifier, I run into the problem that the RAM of the device is exceeded (think of something in the kB range).

    Do you have any idea how the footprint of the c code could be reduced?

    Code generated for XGBoost models returns invalid scores when tree_method is set to "hist"

    I have trained xgboost models in Python and am using the CLI interface to convert the serialized models to pure python. However, when I use the pure python, the results differ from the predictions using the model directly.

    Python 3.7
    xgboost 0.90

    My model has a large number of parameters (somewhat over 500).
    Here are predicted class probabilities from the original model:
    image

    Here are the same predicted probabilities using the generated python code via m2cgen:
    image

    We can see that the results are similar but not the same. The result is a significant number of cases that are moved into different classes between the two sets of predictions.

    I have also tested this with binary classification models and have the same issues.

    convert lightgbm gbdt bug

    Model Booster is not supported error. When using a light gbm model trained with 'gbdt'. export function fails with model not supported error. Below is snippet code.

    import os
    import h5py
    
    import lightgbm as lgb
    import numpy as np
    import m2cgen as m2c
    
    with h5py.File('./sample.hdf5') as f:
        X, y = f['X'][()], f['y'][()]
    
    dtrain=lgb.Dataset(X[:1000,:],label=y[:1000])
    
    param = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'num_leaves':2**4,
        'max_depth': 4, 
        'learning_rate': 0.1,
        'verbose': 0}
    n_estimators = 5
    
    bst = lgb.train(param, dtrain, n_estimators)
    code = m2c.export_to_c(bst)
    

    Additionally, here is the error output:

    ---------------------------------------------------------------------------
    NotImplementedError                       Traceback (most recent call last)
    <ipython-input-5-226faacde9f0> in <module>
    ----> 1 code = m2c.export_to_java(bst)
          2 print(code)
    
    C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\exporters.py in export_to_java(model, package_name, class_name, indent)
         26         class_name=class_name,
         27         indent=indent)
    ---> 28     return _export(model, interpreter)
         29 
         30 
    
    C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\exporters.py in _export(model, interpreter)
         87 
         88 def _export(model, interpreter):
    ---> 89     assembler_cls = assemblers.get_assembler_cls(model)
         90     model_ast = assembler_cls(model).assemble()
         91     return interpreter.interpret(model_ast)
    
    C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\assemblers\__init__.py in get_assembler_cls(model)
         74     if not assembler_cls:
         75         raise NotImplementedError(
    ---> 76             "Model {} is not supported".format(model_name))
         77 
         78     return assembler_cls
    
    NotImplementedError: Model Booster is not supported
    

    add option to save generated code into file

    I'm sorry if I missed this functionality, but CLI version hasn't it for sure (I saw the related code only in generate_code_examples.py). I guess it will be very useful to eliminate copy-paste phase, especially for large models.

    Of course, piping is a solution, but not for development in Jupyter Notebook, for example.

    RecursionError: maximum recursion depth exceeded

    The problem here is the number of columns. My df shape is [1428 rows x 3100 columns]

    CODE:

    #%%
    
    import pandas as pd
    
    df = pd.read_csv("doc_vector.csv")
    
    #%%
    
    df = df.drop(columns=["document", "keywords"])
    
    #%%
    
    X = df.drop(columns=["category"]).values
    y = df.filter(["category"]).values.ravel()
    
    #%%
    
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.33,
        random_state=42
    )
    
    #%%
    
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    
    parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                         'C': [1, 10, 100, 1000]},
                        {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
    svc = SVC(gamma='auto')
    clf = GridSearchCV(svc, parameters)
    clf.fit(X_train, y_train)
    sorted(clf.cv_results_.keys())
    """
    OUTPUT:
    ['mean_fit_time',
     'mean_score_time',
     'mean_test_score',
     'param_C',
     'param_gamma',
     'param_kernel',
     'params',
     'rank_test_score',
     'split0_test_score',
     'split1_test_score',
     'split2_test_score',
     'split3_test_score',
     'split4_test_score',
     'std_fit_time',
     'std_score_time',
     'std_test_score']
    """
    
    #%%
    
    clf.cv_results_
    """
    OUTPUT:
    
    {'mean_fit_time': array([1.81106772, 2.2082305 , 1.76409402, 1.627356  , 1.75857201,
            1.56271949, 1.75669813, 1.56454921, 1.49936023, 1.51295609,
            1.53478389, 1.54050641]),
     'std_fit_time': array([0.02475751, 0.01825799, 0.03386408, 0.01870898, 0.03761543,
            0.01331777, 0.03751674, 0.01426082, 0.01782674, 0.03306172,
            0.04617239, 0.03143336]),
     'mean_score_time': array([0.33980088, 0.41657252, 0.33987026, 0.32597055, 0.3384655 ,
            0.32319117, 0.33914285, 0.32390838, 0.31457753, 0.31692972,
            0.32040634, 0.32276688]),
     'std_score_time': array([0.00426356, 0.00054301, 0.00069592, 0.00271754, 0.00489743,
            0.00426302, 0.00481799, 0.00548289, 0.00591126, 0.00769205,
            0.00599371, 0.01014503]),
     'param_C': masked_array(data=[1, 1, 10, 10, 100, 100, 1000, 1000, 1, 10, 100, 1000],
                  mask=[False, False, False, False, False, False, False, False,
                        False, False, False, False],
            fill_value='?',
                 dtype=object),
     'param_gamma': masked_array(data=[0.001, 0.0001, 0.001, 0.0001, 0.001, 0.0001, 0.001,
                        0.0001, --, --, --, --],
                  mask=[False, False, False, False, False, False, False, False,
                         True,  True,  True,  True],
            fill_value='?',
                 dtype=object),
     'param_kernel': masked_array(data=['rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf',
                        'linear', 'linear', 'linear', 'linear'],
                  mask=[False, False, False, False, False, False, False, False,
                        False, False, False, False],
            fill_value='?',
                 dtype=object),
     'params': [{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'},
      {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'},
      {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'},
      {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'},
      {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'},
      {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'},
      {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'},
      {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'},
      {'C': 1, 'kernel': 'linear'},
      {'C': 10, 'kernel': 'linear'},
      {'C': 100, 'kernel': 'linear'},
      {'C': 1000, 'kernel': 'linear'}],
     'split0_test_score': array([0.78645833, 0.63020833, 0.85416667, 0.80208333, 0.84375   ,
            0.84895833, 0.84375   , 0.83854167, 0.83854167, 0.83854167,
            0.83854167, 0.83854167]),
     'split1_test_score': array([0.79057592, 0.64397906, 0.85863874, 0.80104712, 0.85340314,
            0.85340314, 0.85863874, 0.85340314, 0.85863874, 0.85863874,
            0.85863874, 0.85863874]),
     'split2_test_score': array([0.79057592, 0.64921466, 0.85863874, 0.81151832, 0.85863874,
            0.86387435, 0.85863874, 0.85863874, 0.85863874, 0.85863874,
            0.85863874, 0.85863874]),
     'split3_test_score': array([0.80628272, 0.64397906, 0.86387435, 0.80628272, 0.86387435,
            0.86910995, 0.85863874, 0.86387435, 0.86387435, 0.86387435,
            0.86387435, 0.86387435]),
     'split4_test_score': array([0.77486911, 0.60732984, 0.84816754, 0.78534031, 0.84293194,
            0.84816754, 0.84293194, 0.84816754, 0.84293194, 0.84293194,
            0.84293194, 0.84293194]),
     'mean_test_score': array([0.7897524 , 0.63494219, 0.85669721, 0.80125436, 0.85251963,
            0.85670266, 0.85251963, 0.85252509, 0.85252509, 0.85252509,
            0.85252509, 0.85252509]),
     'std_test_score': array([0.01006947, 0.01517817, 0.00525755, 0.00877064, 0.00819737,
            0.00835564, 0.00749881, 0.00873473, 0.00991084, 0.00991084,
            0.00991084, 0.00991084]),
     'rank_test_score': array([11, 12,  2, 10,  8,  1,  8,  3,  3,  3,  3,  3], dtype=int32)}
    """
    
    #%%
    
    clf.best_params_
    """
    OUTPUT:
    {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
    """
    
    #%%
    
    clf.best_estimator_
    """
    OUTPUT:
    
    SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
        max_iter=-1, probability=False, random_state=None, shrinking=True,
        tol=0.001, verbose=False)
    """
    
    #%%
    
    import m2cgen as m2c
    
    code = m2c.export_to_c(clf.best_estimator_)
    
    print(code)

    ERROR:

    
    ---------------------------------------------------------------------------
    
    RecursionError                            Traceback (most recent call last)
    
    <ipython-input-24-43271f44552b> in <module>
          1 import m2cgen as m2c
          2 
    ----> 3 code = m2c.export_to_c(clf.best_estimator_)
          4 
          5 print(code)
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/exporters.py in export_to_c(model, indent)
         64     """
         65     interpreter = interpreters.CInterpreter(indent=indent)
    ---> 66     return _export(model, interpreter)
         67 
         68 
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/exporters.py in _export(model, interpreter)
        197 def _export(model, interpreter):
        198     assembler_cls = assemblers.get_assembler_cls(model)
    --> 199     model_ast = assembler_cls(model).assemble()
        200     return interpreter.interpret(model_ast)
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in assemble(self)
         37     def assemble(self):
         38         if self._output_size > 1:
    ---> 39             return self._assemble_multi_class_output()
         40         else:
         41             return self._assemble_single_output()
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _assemble_multi_class_output(self)
         66         n_support_len = len(n_support)
         67 
    ---> 68         kernel_exprs = self._apply_kernel(support_vectors, to_reuse=True)
         69 
         70         support_ranges = []
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _apply_kernel(self, support_vectors, to_reuse)
        100         kernel_exprs = []
        101         for v in support_vectors:
    --> 102             kernel = self._kernel_fun(v)
        103             kernel_exprs.append(ast.SubroutineExpr(kernel, to_reuse=to_reuse))
        104         return kernel_exprs
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _rbf_kernel(self, support_vector)
        113         ]
        114         kernel = utils.apply_op_to_expressions(ast.BinNumOpType.ADD,
    --> 115                                                *elem_wise)
        116         kernel = utils.mul(self._neg_gamma_expr, kernel)
        117         return ast.ExpExpr(kernel)
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in apply_op_to_expressions(op, to_reuse, *exprs)
         55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
         56 
    ---> 57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])
         58     result.to_reuse = to_reuse
         59     return result
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in _inner(current_expr, *rest_exprs)
         53 
         54         return _inner(
    ---> 55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
         56 
         57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])
    
    ... last 1 frames repeated, from the frame below ...
    
    ~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in _inner(current_expr, *rest_exprs)
         53 
         54         return _inner(
    ---> 55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
         56 
         57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])
    
    RecursionError: maximum recursion depth exceeded
    

    Support for Categorical Variables in LightGBM

    LightGBM supports categorical variables using an integer encoding. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support

    The way this is represented in the tree is using the equals operator and a pipe delimited string for the categorical set.

    Example

    # Normal split
    "threshold": 4.285125194551891,
    "decision_type": "<=",
    
    
    # Categorial split
    "threshold": "4||6||7||8||9||22||28||32||63||64",
    "decision_type": "=="
    

    It's important to optimize the set membership check for performance reasons. At a minimum I think we'll need to

    • use a data structure like a set (or binary search an array)
    • hoist the set to the top level so its only initialized once

    I tried a naive solution (inlined ors feat == a || feat == b ...) in addition to hoisted sets in java. The performance difference was ~6x on my model which has a large number of categorical operations (~30 members in each set)

    If my understanding of m2cgen is right, I think this should be a new operator in interpreters/interpreter.py

    I have a PoC here master...Zocdoc:cs_categorical if you'd like to look at the code. Its just a hack and would like your guidance on making this changes in a better way.

    If its helpful I can open a pr and we can discuss the code there (also github allows maintainer commits to pr branches now which is nice).

    Add support for XGBoost Random Forest for multiclass task

    Initial support for regression and binary classification tasks was added in #157. Unfortunately, multiclass case is not so trivial and requires deeper knowledge of XGBoost internals (dumped model representation and prediction logic) to add support for it.

    Add support for Gradient Boosting from scikit-learn

    Is there any way to transpile sklearn GBM models? I know it's not supported in the library. But maybe some way to convert it into one of the supported models, and then transpile. Any leads would be appreciated. Thanks!

    gcc crashes compiling output

    Similar to #88, but in my case the problem isn't the size of the binary but that gcc sometimes runs out of memory building the C output.

    Could the output be broken up over multiple files to reduce compiler memory use?

    LightGBM's original predictions differ from the transpiled JS predictions

    I've a LGBMRegressor model (LightGBM v2.3.1) and I transpiled it to JavaScript code.

    My dataset has about 60 numeric features (no categorical features). Features can contain missing values (they are np.nan in Python code and null in JavaScript).

    This is my LightGBM model:

    lgb.LGBMRegressor(
        objective='mse',
        max_depth=5,
        first_metric_only=True,
        boosting_type='gbdt',
        importance_type='gain',
        feature_fraction=np.sqrt(len(features))/len(features),
        subsample=0.4,
        seed=RANDOM_STATE,
        eta=0.02,
        nthread=0,
        reg_alpha=0.1,
        reg_lambda=0.1,
        num_leaves=31,
        n_estimators=50
    ) 
    

    Unfortunately I can't share the dataset.

    The problem is that LightGBM's original predictions (from Python) differ from the transpiled JS predictions. They are similar, but different. If I plot a distribution of the differences between the original predictions and the JS predictions, I get a sort of normal distribution with mean=0, but in the tail of the distribution there are some pretty large differences.

    Code generated from XGBoost model includes "None"

    When transpiling XGBRegressor and XGBClassifier models such as the following basic example:

    from xgboost import XGBRegressor
    from sklearn import datasets
    import m2cgen as m2c
    
    iris_data = datasets.load_iris(return_X_y=True)
    
    mod = XGBRegressor(booster="gblinear", max_depth=2)
    X, y = iris_data
    mod.fit(X[:120], y[:120])
    
    code = m2c.export_to_c(mod)
    
    print(code)

    the resulting c-code includes a Pythonesque None :

    double score(double * input) {
        return (None) + (((((-0.391196) + ((input[0]) * (-0.0196191))) + ((input[1]) * (-0.11313))) + ((input[2]) * (0.137024))) + ((input[3]) * (0.645197)));
    }

    Probably I am missing some basic step?

    drop numpy dependency from Python code for cases without vectors

    According to this line, it seems that numpy is used as a default math library for runtime even when we do not operate with vectors.

    if self.with_vectors or self.with_math_module:
    self._cg.add_dependency("numpy", alias="np")

    Let me describe two advantages of dropping numpy where it's possible.

    The first one is excess dependence. Even though numpy is a sort of "classic" dependence and there should be no problems with installing it, it requires additional manipulation from a user side. Also, there are some companies with very strict security policies, which prohibit using pip (conda, brew, and other package managers). So, I guess, for them raw Python may be preferable solution in cases where it's possible.

    The second one is speed. numpy is about efficient vector math, in other cases it only produces redundant computational cost. Consider the following example. Take this generated Python code from the repo, change return type from np.array to simple list, replace the following things in script:

    • numpy -> math
    • np.exp -> math.exp
    • np.power -> math.pow

    Here what we get after removing numpy:

    import math
    def score_raw(input):
        var0 = (0) - (0.25)
        var1 = math.exp((var0) * ((((math.pow((5.4) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
        var2 = math.exp((var0) * ((((math.pow((6.2) - (input[0]), 2)) + (math.pow((2.2) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
        var3 = math.exp((var0) * ((((math.pow((5.0) - (input[0]), 2)) + (math.pow((2.3) - (input[1]), 2))) + (math.pow((3.3) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
        var4 = math.exp((var0) * ((((math.pow((5.9) - (input[0]), 2)) + (math.pow((3.2) - (input[1]), 2))) + (math.pow((4.8) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
        var5 = math.exp((var0) * ((((math.pow((5.0) - (input[0]), 2)) + (math.pow((2.0) - (input[1]), 2))) + (math.pow((3.5) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
        var6 = math.exp((var0) * ((((math.pow((6.7) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((5.0) - (input[2]), 2))) + (math.pow((1.7) - (input[3]), 2))))
        var7 = math.exp((var0) * ((((math.pow((7.0) - (input[0]), 2)) + (math.pow((3.2) - (input[1]), 2))) + (math.pow((4.7) - (input[2]), 2))) + (math.pow((1.4) - (input[3]), 2))))
        var8 = math.exp((var0) * ((((math.pow((4.9) - (input[0]), 2)) + (math.pow((2.4) - (input[1]), 2))) + (math.pow((3.3) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
        var9 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.5) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
        var10 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((2.7) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((1.6) - (input[3]), 2))))
        var11 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((2.6) - (input[1]), 2))) + (math.pow((3.5) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
        var12 = math.exp((var0) * ((((math.pow((5.1) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((1.9) - (input[2]), 2))) + (math.pow((0.4) - (input[3]), 2))))
        var13 = math.exp((var0) * ((((math.pow((4.4) - (input[0]), 2)) + (math.pow((2.9) - (input[1]), 2))) + (math.pow((1.4) - (input[2]), 2))) + (math.pow((0.2) - (input[3]), 2))))
        var14 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((4.4) - (input[1]), 2))) + (math.pow((1.5) - (input[2]), 2))) + (math.pow((0.4) - (input[3]), 2))))
        var15 = math.exp((var0) * ((((math.pow((5.8) - (input[0]), 2)) + (math.pow((4.0) - (input[1]), 2))) + (math.pow((1.2) - (input[2]), 2))) + (math.pow((0.2) - (input[3]), 2))))
        var16 = math.exp((var0) * ((((math.pow((5.1) - (input[0]), 2)) + (math.pow((3.3) - (input[1]), 2))) + (math.pow((1.7) - (input[2]), 2))) + (math.pow((0.5) - (input[3]), 2))))
        var17 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((1.7) - (input[2]), 2))) + (math.pow((0.3) - (input[3]), 2))))
        var18 = math.exp((var0) * ((((math.pow((4.3) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((1.1) - (input[2]), 2))) + (math.pow((0.1) - (input[3]), 2))))
        var19 = math.exp((var0) * ((((math.pow((4.5) - (input[0]), 2)) + (math.pow((2.3) - (input[1]), 2))) + (math.pow((1.3) - (input[2]), 2))) + (math.pow((0.3) - (input[3]), 2))))
        var20 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.7) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
        var21 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.8) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
        var22 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.8) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
        var23 = math.exp((var0) * ((((math.pow((5.8) - (input[0]), 2)) + (math.pow((2.8) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((2.4) - (input[3]), 2))))
        var24 = math.exp((var0) * ((((math.pow((6.1) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
        var25 = math.exp((var0) * ((((math.pow((7.7) - (input[0]), 2)) + (math.pow((2.6) - (input[1]), 2))) + (math.pow((6.9) - (input[2]), 2))) + (math.pow((2.3) - (input[3]), 2))))
        var26 = math.exp((var0) * ((((math.pow((6.9) - (input[0]), 2)) + (math.pow((3.1) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((2.3) - (input[3]), 2))))
        var27 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((3.3) - (input[1]), 2))) + (math.pow((6.0) - (input[2]), 2))) + (math.pow((2.5) - (input[3]), 2))))
        var28 = math.exp((var0) * ((((math.pow((4.9) - (input[0]), 2)) + (math.pow((2.5) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.7) - (input[3]), 2))))
        var29 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((2.2) - (input[1]), 2))) + (math.pow((5.0) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
        var30 = math.exp((var0) * ((((math.pow((7.9) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((6.4) - (input[2]), 2))) + (math.pow((2.0) - (input[3]), 2))))
        var31 = math.exp((var0) * ((((math.pow((7.2) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((5.8) - (input[2]), 2))) + (math.pow((1.6) - (input[3]), 2))))
        var32 = math.exp((var0) * ((((math.pow((7.7) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((6.7) - (input[2]), 2))) + (math.pow((2.2) - (input[3]), 2))))
        return [(((((((((((((((((((-0.08359187780790468) + ((var1) * (-0.0))) + ((var2) * (-0.0))) + ((var3) * (-0.4393498355605194))) + ((var4) * (-0.009465620856664334))) + ((var5) * (-0.16223369966927))) + ((var6) * (-0.26861888775075243))) + ((var7) * (-0.4393498355605194))) + ((var8) * (-0.4393498355605194))) + ((var9) * (-0.0))) + ((var10) * (-0.0))) + ((var11) * (-0.19673905328606292))) + ((var12) * (0.3340655283922188))) + ((var13) * (0.3435087305152051))) + ((var14) * (0.4393498355605194))) + ((var15) * (0.0))) + ((var16) * (0.28614124535416424))) + ((var17) * (0.11269159286168087))) + ((var18) * (0.0))) + ((var19) * (0.4393498355605194)), (((((((((((((((((((((-0.18563912331454907) + ((var20) * (-0.0))) + ((var21) * (-0.06014273244194299))) + ((var22) * (-0.0))) + ((var23) * (-0.031132453078851926))) + ((var24) * (-0.0))) + ((var25) * (-0.3893079321588921))) + ((var26) * (-0.06738007627290196))) + ((var27) * (-0.1225075748937126))) + ((var28) * (-0.3893079321588921))) + ((var29) * (-0.29402231709614085))) + ((var30) * (-0.3893079321588921))) + ((var31) * (-0.0))) + ((var32) * (-0.028242141062729226))) + ((var12) * (0.16634667752431267))) + ((var13) * (0.047772685163074764))) + ((var14) * (0.3893079321588921))) + ((var15) * (0.3893079321588921))) + ((var16) * (0.0))) + ((var17) * (0.0))) + ((var18) * (0.3893079321588921))) + ((var19) * (0.3893079321588921)), ((((((((((((((((((((((((0.5566649875797668) + ((var20) * (-25.563066587228416))) + ((var21) * (-38.35628154976547))) + ((var22) * (-38.35628154976547))) + ((var23) * (-0.0))) + ((var24) * (-38.35628154976547))) + ((var25) * (-0.0))) + ((var26) * (-0.0))) + ((var27) * (-0.0))) + ((var28) * (-6.2260303727828745))) + ((var29) * (-18.42781911624364))) + ((var30) * (-0.14775026537286423))) + ((var31) * (-7.169755983020096))) + ((var32) * (-0.0))) + ((var1) * (12.612328267927264))) + ((var2) * (6.565812506955159))) + ((var3) * (0.0))) + ((var4) * (38.35628154976547))) + ((var5) * (0.0))) + ((var6) * (38.35628154976547))) + ((var7) * (0.0))) + ((var8) * (0.0))) + ((var9) * (38.35628154976547))) + ((var10) * (38.35628154976547))) + ((var11) * (0.0))]
    

    And here are some timings:

    %%timeit -n 10000
    score([1, 2, 3, 4])
    
    310 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %%timeit -n 10000
    score_raw([1, 2, 3, 4])
    
    39.4 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Results seems to be identical:

    np.testing.assert_allclose(score([1, 2, 3, 4]), score_raw([1, 2, 3, 4]))
    

    Please share your thoughts about this refactoring.

    m2cgen output for xgboost with binary:logistic objective returns raw (not transformed) scores

    Our xgboost models use the binary:logistic' objective function, however the m2cgen converted version of the models return raw scores instead of the transformed scores.

    This is fine as long as the user knows this is happening! I didn't, so it took a while to figure out what was going on. I'm wondering if perhaps a useful warning could be raised for users to alert them of this issue? A warning could include a note that they can transform these scores back to the expected probabilities [0, 1] by prob = logistic.cdf(score - base_score) where base_score is an attribute of the xgboost model.

    In our case, I'd like to minimize unnecessary processing on the device, so I am actually happy with the current m2cgen output and will instead inverse transform our threshold when evaluating the model output from the transpiled model...but it did take me a bit before I figured out what was going on, which is why I'm suggesting that a user friendly message might be raised when an unsupported objective function is encountered.

    Thanks for creating & sharing this great tool!

    Converted version outputs index of class instead of class

    If I train a classifier with non-consecutive numbers for classes, the resulting converted code (C in my case) will not output the classes but the index of the class. In my case I simply don't have an example for class 1 in all cases, so the classifier will not know this class exists. This creates discrepancies between Python and C.

    from sklearn.ensemble import RandomForestClassifier
    # linear mapping: x->x
    # NB: my goal is not regression, this is just an example
    x_train = np.repeat([0,1,2,3,4,5], 100).reshape([-1,1])
    y_train = np.repeat([0,1,2,3,4,5], 100)
    
    # however, class 1 is missing in training!
    x_train = x_train[y_train!=1]
    y_train = y_train[y_train!=1]
    
    clf = RandomForestClassifier().fit(x_train, y_train)
    
    # convert it
    code = m2cgen.export_to_c(clf)
    
    result = clf.predict(np.atleast_2d([0,1,2,3,4,5]).T)
    # result =[0,0,2,3,4,5]
    

    Calling it in C will give different results

    # Pseudocode for C
    double result[5] = score([0,1,2,3,4,5])
    
    #result = [0,0,1,2,3,4]
    

    Do you think there is any feasible way to keep original class label?

    (see also nok/sklearn-porter#37 having the same problem)

    Travis 50min limit... again

    Today I saw our jobs at master hit 50min Travis limit per job 3 times. Guess, it's time to either review #243 or reorganize jobs at Travis. Refer to #125 for the past experience and to #114 for some further ideas.

    cc @izeigerman

    add support for declarative languages

    It seems that especially for functional languages it is very common to write Score() function and apply it to some data.

    At present, assemblers make assumption about that target language is imperative.

    Large LightGBM causes javac error "Code too Large"

    When generating code for a large number of trees, the generated code exceeds the 64KB limit in java.

    From Stackoverflow

    A single method in a Java class may be at most 64KB of bytecode.

    One solution is to add subfunctions https://github.com/BayesWitnesses/m2cgen/blob/master/m2cgen/assemblers/boosting.py#L43-L48 instead of having the body of every tree inside subroutine0. The amount of code that will fit inside each function is dependent on its depth + width so we might require some heuristic or tunable parameter. In my case, I ended up with 10 trees per subfunction

    I'm not sure if there are similar limits in other languages

    C-code generated from XGBoost model includes "None

    When running the following code:

    from xgboost import XGBRegressor
    from sklearn import datasets
    import m2cgen as m2c
    
    iris_data = datasets.load_iris(return_X_y=True)
    
    mod = XGBRegressor()
    X, y = iris_data
    mod.fit(X, y)
    
    code = m2c.export_to_c(mod)
    
    print(code)

    the printed c code includes a Pythonesque None :

    ...
     return (((((((((((((((((((((((((((((((((((((((((((((((((((((((
    (((((((((((((((((((((((((((((((((((((((((((((None) + (var0)) +
     (var1)) + (var2)) + (var3)) + (var4)) + (var5)) + (var6)) +
     (var7)) + (var8)) + (var9)) + (var10)) + (var11)) + (var12)) +  ...
    ... 

    Dart language support

    For those building Flutter apps that would like to be able to utilize static models trained in scikit on-device, this tool would be a perfect fit. And if the Flutter dev team decides to add a hot code push feature to the framework, models from m2cgen could be updated on the fly.

    Will there be an R interface

    Hello, do you plan to provide an interface for the R language, I think that the R language is comparable to Python in some aspects, can you give R an interface?

    Prepare for release 0.1.0

    • Classification support for ensemble models.
    • Classification for Python.
    • setup.py and release procedure.
    • Enable more sklearn models (like remaining linear models).
    • README + docs + examples.
    • Revisit the library API (exporters).
    • Implement CLI.
    • Deal with Python limitations on nested function calls (#57)

    Optional:

    • C language support
    • XGBoost/LightGBM

    What might cause an invalid load key on conversion?

    When calling m2cgen tpot_classify.pkl --language go on a perfectly fine Pickle file, I receive the following error:

    Traceback (most recent call last):
      File "/usr/local/bin/m2cgen", line 10, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.7/site-packages/m2cgen/cli.py", line 86, in main
        print(generate_code(args))
      File "/usr/local/lib/python3.7/site-packages/m2cgen/cli.py", line 71, in generate_code
        model = pickle.load(f)
    _pickle.UnpicklingError: invalid load key, '\x00'.
    

    I'm curious if you might know what would cause this NULL byte to appear?

    Migrate to f-strings

    m2cgen's codebase heavily utilizes string formatting and concatenation mechanisms. f-strings (brief guide) are known to be the fastest method to format string (1, 2) and they increase code readability a lot.

    image
    Source: https://cito.github.io/blog/f-strings/.

    One problem is only that they are supported starting from Python 3.6, but currently m2cgen supports Python 3.5.

    "Programming Language :: Python :: 3.5",

    My suggestion is the following. Mark the next 0.8.0 release as the latest which supports Python 3.5 and drop the support in 0.9.0 release. It will be approximately in the same time as Python 3.5 reaches its EOL (2020-09-13).

    Better document the usage of OOP API

    I can't find any documentation of how to use m2cgen via OOP API. If I'm not mistaken, we only have examples how to use it via functional API. However, OOP API gives more options to customize code generation. I mean, the only way to change e.g. bin_depth_threshold is to change attributes of interpretor class. I believe it is very important because in most cases we set default values just randomly to pass particular tests. For instance, refer to

    # R doesn't allow to have more than 50 nested if, [, [[, {, ( calls.
    # It raises contextstack overflow error not only for explicitly nested
    # calls, but also if met above mentioned number of parentheses
    # in one expression. Given that there is no way to control
    # the number of parentheses in one expression for now,
    # the following variable set to 50 / 2 value is expected to prevent
    # contextstack overflow error occurrence.
    # This value is just a heuristic and is subject to change in the future
    # based on the users' feedback.
    bin_depth_threshold = 25

    Also, we may not know about different other limitations of supported languages. For example, today I learned from one recent great blog post that C# has a limit for number of local variables.

    Есть неплохая библиотека на питоне m2cgen, которая позволяет экспортировать в C, C#, Dart, Go, Haskell, Java, JavaScript, PHP, PowerShell, Python, R, Ruby, Visual Basic. На выходе вы получаете готовый модуль, который может быть скомпилирован вашим любимым компилятором (т.е. без использования каких-либо dll!). С m2cgen есть некоторые ограничения на сложность (к примеру C# может уткнуться в ограничение 64 тысячи локальных переменных, можно попытаться обойти ограничение путем создания нескольких небольших процедур в замен одной большой).
    https://imageman72.livejournal.com/47186.html

    One can easily overcome similar limitations with the help of our mixins by inheriting them in custom class without the need to modify package source code. And it can't be done via functional API.

    In scikit-learn SVC convert One-vs-one decisions to One-vs-rest

    Refer to

    # One-vs-one decisions.
    decisions = []
    for i in range(n_support_len):
    for j in range(i + 1, n_support_len):
    kernel_weight_mul_ops = [
    utils.mul(kernel_exprs[k], ast.NumVal(coef[i][k]))
    for k in range(*support_ranges[j])
    ]
    kernel_weight_mul_ops.extend([
    utils.mul(kernel_exprs[k], ast.NumVal(coef[j - 1][k]))
    for k in range(*support_ranges[i])
    ])
    decision = utils.apply_op_to_expressions(
    ast.BinNumOpType.ADD,
    ast.NumVal(intercept[len(decisions)]),
    *kernel_weight_mul_ops
    )
    decisions.append(decision)

    Code generated for XGBoost models return error scores when feature input include zero which result in xgboost "missing"

    I’m try using m2cgen to generate js code for XGBoost model,but find that if the feature input include zero,the result which calculate by generated js has a big difference with the result which predicted by model. For example, if the feature input is [0.4444,0.55555,0.3545,0.22333],the result which calculate by generated js equals the result which predicted by model,but if the feature input is [0.4444,0,0,0.22333],the result which calculate by generated js will be very different from the result which predicted by model,maybe one result is 0.22 ,the other one result is 0.04。After we validate by demo,we find that m2cgen not process “missing” condition. when xgboost result in “missing”, m2cgen will process it as “yes”

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.