combust / mleap-demo Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 27.0 10.4 MB

Demonstration code for MLeap, both Jupyter notebooks and projects

Jupyter Notebook 100.00%

mleap-demo's People

Contributors

Stargazers

Watchers

mleap-demo's Issues

Airbnb price regression, dataset unzip error

Following the tutorial on https://github.com/combust/mleap-demo/blob/master/notebooks/airbnb-price-regression.ipynb , I downloaded the dataset from https://s3-us-west-2.amazonaws.com/mleap-demo/datasources/airbnb.avro.zip . However the file cant be unzipped, and fails with the error:

  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
note:  airbnb.avro.zip may be a plain executable, not an archive
unzip:  cannot find zipfile directory in one of airbnb.avro.zip or
        airbnb.avro.zip.zip, and cannot find airbnb.avro.zip.ZIP, period.

AttributeError: 'OneHotEncoder' object has no attribute 'n_values_'

I tried serializing pipeline with mleap but it's giving error
"ColumnTransformer' object has no attribute 'op"

Below are segments from the pipline:
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features), ])

mlp = MLPClassifier(hidden_layer_sizes=(8,6,1), max_iter=300,activation = 'tanh',solver='adam',random_state=123)

pipe = Pipeline([('preprocessor', preprocessor), ('mlp', mlp) ])
pipe.mlinit()

model = pipe.fit(X_train,y_train)

model.serialize_to_bundle("jar:file:/C://Users/logReg.zip")

doesn't work on 0.8.1

Seems the pyspark and scala notebooks don't work due to changes in the underlying API. The notebooks needs to be fixed to work with the recent versions.

I'm going through this example and this doesn't seem to work using the current master branch on 0.6.0. There is no mleap.pyspark in master. That said, I also tried using branch feature/scikit-v2 which does have mleap.pyspark, but then when I get to the bottom, it just says 'Pipeline' object has no attribute 'serializeToBundle'. Any ideas on what is going on here?

Impossible to launch the scikit-learn demo: Class Imputer is deprecated

When running the scikit-learn pipeline, I have an issue at this line:

impute_security_deposit_tf.mlinit(prior_tf=feature_extractor2_tf, output_features='imputed_features')

'function' object has no attribute 'im_class'

Multiple artifacts of the module net.sourceforge.f2j#arpack_combined_all;0.1 are retrieved to the same file! Update the retrieve pattern to fix this error

I am trying to use spark-shell using mleap as a package with following command:

spark-shell --packages ml.combust.mleap:mleap-runtime_2.11:0.7.0

Here is the error that I get:
Exception in thread "main" java.lang.RuntimeException: problem during retrieve of org.apache.spark#spark-submit-parent: java.lang.RuntimeException: Multiple artifacts of the module net.sourceforge.f2j#arpack_combined_all;0.1 are retrieved to the same file! Update the retrieve pattern to fix this error.
at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1086)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Multiple artifacts of the module net.sourceforge.f2j#arpack_combined_all;0.1 are retrieved to the same file! Update the retrieve pattern to fix this error.
at org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:417)
at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)

Please help in resolving it. I am using spark 2.1

Serializing mlp classifier (sk learn) with mleap serialize_to_bundle

I tried serializing pipeline with mleap but it's giving error
"ColumnTransformer' object has no attribute 'op"

Below are segments from the pipline:
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features), ])

mlp = MLPClassifier(hidden_layer_sizes=(8,6,1), max_iter=300,activation = 'tanh',solver='adam',random_state=123)

pipe = Pipeline([('preprocessor', preprocessor), ('mlp', mlp) ])
pipe.mlinit()

model = pipe.fit(X_train,y_train)

model.serialize_to_bundle("jar:file:/C://Users/logReg.zip")

/tmp/airbnb.csv nowhere to be found

the s3 bucket which hosted airbnb.csv do not allow public read access anymore. where is airbnb.csv anyway?

AttributeError: 'Pipeline' object has no attribute 'name'

tried serializing the below pipepline

if I try remove the init argument from serialize to bundle
error is
"AttributeError: 'OutletTypeEncoder' object has no attribute 'op'"

importing required libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
#from sklearn.preprocessing import StandardScaler, MinMaxScaler, Imputer, Binarizer, PolynomialFeatures
from sklearn.pipeline import Pipeline
import mleap.sklearn.pipeline
import mleap.sklearn.feature_union
import mleap.sklearn.base
import mleap.sklearn.logistic
import mleap.sklearn.preprocessing

read the training data set

data = pd.read_csv('market.csv')

top rows of the data

#print(data.head(5))

seperate the independent and target variables

train_x = data.drop(columns=['Item_Outlet_Sales'])
train_y = data['Item_Outlet_Sales']

import the BaseEstimator

from sklearn.base import BaseEstimator

define the class OutletTypeEncoder

This will be our custom transformer that will create 3 new binary columns

custom transformer must have methods fit and transform

class OutletTypeEncoder(BaseEstimator):

def __init__(self):
    pass

def fit(self, documents, y=None):
    return self

def transform(self, x_dataset):
    x_dataset['outlet_grocery_store'] = (x_dataset['Outlet_Type'] == 'Grocery Store') * 1
    x_dataset['outlet_supermarket_3'] = (x_dataset['Outlet_Type'] == 'Supermarket Type3') * 1
    x_dataset['outlet_identifier_OUT027'] = (x_dataset['Outlet_Identifier'] == 'OUT027') * 1

    return x_dataset

pre-processsing step

Drop the columns -

Impute the missing values in column Item_Weight by mean

Scale the data in the column Item_MRP

pre_process = ColumnTransformer(remainder='passthrough',
transformers=[('drop_columns', 'drop', ['Item_Identifier',
'Outlet_Identifier',
'Item_Fat_Content',
'Item_Type',
'Outlet_Identifier',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type'
]),
('impute_item_weight', SimpleImputer(strategy='mean'), ['Item_Weight']),
('scale_data', StandardScaler(),['Item_MRP'])])

Define the Pipeline

"""
Step1: get the oultet binary columns
Step2: pre processing
Step3: Train a Random Forest Model
"""
model_pipeline = Pipeline(steps=[('get_outlet_binary_columns', OutletTypeEncoder()),
('pre_processing',pre_process),
('random_forest', RandomForestRegressor(max_depth=10,random_state=2))
])

fit the pipeline with the training data

model_pipeline.fit(train_x,train_y)

read the test data

test_data = pd.read_csv('test.csv')

predict target variables on the test data

#model_pipeline.predict(test_data)

Serialiaze the random forest model

model_pipeline.serialize_to_bundle('/tmp', 'market.rf', init=True)

-

Hello there,

I have an issue for running just a basic script :

from river import linear_model
from river import metrics
from river import evaluate
from river import preprocessing
import pandas as pd

data = pd.read_csv("C:/Users/Monster/Desktop/LveR.csv")

Import label encoder

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder()

Encode labels in column 'species'.

data['Class'] = label_encoder.fit_transform(data['Class'])

data['Class'].unique()

X=data.iloc[:,:-1]
y=data.iloc[:,-1]

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 11,test_size=0.25,shuffle=True)

model = (
preprocessing.StandardScaler() |
linear_model.LogisticRegression())

metric=metrics.ROCAUC()
evaluate.progressive_val_score(X_test,y_test,model,metric)