mlflow / recipes-classification-template Goto Github PK

Template repo for kickstarting recipes for classification use case

License: Apache License 2.0

Python 69.70% Jupyter Notebook 30.30%

recipes-classification-template's Introduction

MLflow: A Machine Learning Lifecycle Platform

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are:

MLflow Tracking: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI.
MLflow Projects: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others.
MLflow Models: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker.
MLflow Model Registry: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

Packages

PyPI
conda-forge
CRAN
Maven Central

Job Statuses

Installing

Install MLflow from PyPI via pip install mlflow

MLflow requires conda to be on the PATH for the projects feature.

Nightly snapshots of MLflow master are also available here.

Install a lower dependency subset of MLflow from PyPI via pip install mlflow-skinny Extra dependencies can be added per desired scenario. For example, pip install mlflow-skinny pandas numpy allows for mlflow.pyfunc.log_model support.

Documentation

Official documentation for MLflow can be found at https://mlflow.org/docs/latest/index.html.

Roadmap

The current MLflow Roadmap is available at https://github.com/mlflow/mlflow/milestone/3. We are seeking contributions to all of our roadmap items with the help wanted label. Please see the Contributing section for more information.

Community

For help or questions about MLflow usage (e.g. "how do I do X?") see the docs or Stack Overflow.

To report a bug, file a documentation issue, or submit a feature request, please open a GitHub issue.

For release announcements and other discussions, please subscribe to our mailing list ([email protected]) or join us on Slack.

Running a Sample App With the Tracking API

The programs in examples use the MLflow Tracking API. For instance, run:

python examples/quickstart/mlflow_tracking.py

This program will use MLflow Tracking API, which logs tracking data in ./mlruns. This can then be viewed with the Tracking UI.

Launching the Tracking UI

The MLflow Tracking UI will show runs logged in ./mlruns at http://localhost:5000. Start it with:

mlflow ui

Note: Running mlflow ui from within a clone of MLflow is not recommended - doing so will run the dev UI from source. We recommend running the UI from a different working directory, specifying a backend store via the --backend-store-uri option. Alternatively, see instructions for running the dev UI in the contributor guide.

Running a Project from a URI

The mlflow run command lets you run a project packaged with a MLproject file from a local path or a Git URI:

mlflow run examples/sklearn_elasticnet_wine -P alpha=0.4

mlflow run https://github.com/mlflow/mlflow-example.git -P alpha=0.4

See examples/sklearn_elasticnet_wine for a sample project with an MLproject file.

Saving and Serving Models

To illustrate managing models, the mlflow.sklearn package can log scikit-learn models as MLflow artifacts and then load them again for serving. There is an example training application in examples/sklearn_logistic_regression/train.py that you can run as follows:

$ python examples/sklearn_logistic_regression/train.py
Score: 0.666
Model saved in run <run-id>

$ mlflow models serve --model-uri runs:/<run-id>/model

$ curl -d '{"dataframe_split": {"columns":[0],"index":[0,1],"data":[[1],[-1]]}}' -H 'Content-Type: application/json'  localhost:5000/invocations

Note: If using MLflow skinny (pip install mlflow-skinny) for model serving, additional required dependencies (namely, flask) will need to be installed for the MLflow server to function.

Official MLflow Docker Image

The official MLflow Docker image is available on GitHub Container Registry at https://ghcr.io/mlflow/mlflow.

export CR_PAT=YOUR_TOKEN
echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
# Pull the latest version
docker pull ghcr.io/mlflow/mlflow
# Pull 2.2.1
docker pull ghcr.io/mlflow/mlflow:v2.2.1

Contributing

We happily welcome contributions to MLflow. We are also seeking contributions to items on the MLflow Roadmap. Please see our contribution guide to learn more about contributing to MLflow.

Core Members

MLflow is currently maintained by the following core members with significant contributions from hundreds of exceptionally talented community members.

recipes-classification-template's People

Contributors

Stargazers

Watchers

Forkers

bbarnes52 prithvikannan amitca71 romeuo ulc0 alirezaarch nicolasang333 lindleezy lhd0430

recipes-classification-template's Issues

Is there a way to use k-fold instead of split ratios?

can we use k-fold for splitting datasets?

Custom transform is not working

Hello, I have tried to implement a simple transform using OneHotEncoder, but it is not working.

I tested in both ways:

from sklearn.preprocessing import OneHotEncoder

def transformer_fn():
    return OneHotEncoder()

and

from sklearn.preprocessing import OneHotEncoder

def transformer_fn():
    return OneHotEncoder

Error

2023/04/11 15:02:04 INFO mlflow.recipes.utils.execution: ingest, split: No changes. Skipping.
Run MLFlow Recipe step: transform
2023/04/11 15:02:05 INFO mlflow.recipes.step: Running step transform...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/mlflow/recipes/step.py", line 139, in run
    self.step_card = self._run(output_directory=output_directory)
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/mlflow/recipes/steps/transform.py", line 148, in _run
    train_transformed = transform_dataset(train_df)
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/mlflow/recipes/steps/transform.py", line 144, in transform_dataset
    transformed_features = pd.DataFrame(transformed_features, columns=columns)
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 797, in __init__
    mgr = ndarray_to_mgr(
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 337, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/home/leonardo-moraes/Git/mlflow-recipes-titanic/.venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 408, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (712, 1), indices imply (712, 329)
make: *** [Makefile:31: steps/transform/outputs/transformer.pkl] Error 1

Missing temporary file in train step

Hi guys, I'trying out MLFlow Recipes for the first time in an Azure Databricks environment. Until yesterday, everything went fine from ingestion to prediction. Today however, I'm running into an Error saying that MLFlow can't find what looks to me like a temporary file while training my model. I really don't do anything fancy in all of these steps and want to use an LGBMClassifier for training. MLFlow version is 2.7.0, but it doesn't seem to work on any other version I tried.

experiment_name = "experiment_name"

if not mlflow.get_experiment_by_name(experiment_name):
    mlflow.create_experiment(name=experiment_name )
else:
    mlflow.set_experiment(experiment_name)
experiment = mlflow.get_experiment_by_name(experiment_name)

r = Recipe(profile="databricks")
r.clean()
r.inspect()
r.run("ingest")
r.run("split")
r.run("transform")
r.run("train")

Here's what my estimator function looks like in train.py. estimator_params are defined in recipe.yaml.

def estimator_fn(estimator_params: Dict[str, Any] = None):
    from lightgbm import LGBMClassifier

    if estimator_params is None:
        estimator_params = {}
        
    return LGBMClassifier(**estimator_params)

As I said, the same code worked fine for me yesterday, but today I'm running into this error:

Run MLFlow Recipe step: train
2023/09/13 11:09:36 INFO mlflow.recipes.step: Running step train...
2023/09/13 11:09:38 INFO mlflow.recipes.steps.train: Class imbalance of 0.50 is better than 0.3, no need to rebalance
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/recipes/step.py", line 132, in run
    self.step_card = self._run(output_directory=output_directory)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/recipes/steps/train.py", line 373, in _run
    logged_estimator = self._log_estimator_to_mlflow(fitted_estimator, X_train)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/recipes/steps/train.py", line 1270, in _log_estimator_to_mlflow
    return mlflow.sklearn.log_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/sklearn/__init__.py", line 408, in log_model
    return Model.log(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/models/model.py", line 568, in log
    with TempDir() as tmp:
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/utils/file_utils.py", line 383, in __enter__
    self._path = os.path.abspath(create_tmp_dir())
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f235e133-8940-41eb-b389-d9cf570c187a/lib/python3.10/site-packages/mlflow/utils/file_utils.py", line 830, in create_tmp_dir
    return tempfile.mkdtemp(dir=repl_local_tmp_dir)
  File "/usr/lib/python3.10/tempfile.py", line 507, in mkdtemp
    _os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/repl_tmp_data/ReplId-68395-9c373-e0490-3/tmpuyeyu8co'
make: *** [Makefile:40: steps/train/outputs/model] Error 1

I really don't know what to do since the stacktrace seems to suggest some MLFlow internal error. Any help would be appreciated.

Dictionary for parameters in yaml file

Hi,

I'm trying to use a dictionary to pass a set of parameters in the ingestion.py file using the yaml file.

I use the code below:

steps:
  # Specifies the dataset to use for model development
  ingest: 
      using: "custom"
      location: "../dataset.csv"
      loader_method: read_csv_as_dataframe 
      ingest_params:
          - label: 'fraud_bool'
          - train_sample: 400
          - validation_sample: 200
          - test_sample: 200

In the ingestion.py file, I modified the function to:

def read_csv_as_dataframe(location: str, ingest_params: Dict[str, Any]) -> DataFrame

When I plot the dictionary, the output gives me custom and not the dict.

The objective is to use the yaml file to add flexibility to the pipeline.

How to do that?

The .yaml file cannot be executed in databricks

The .yaml file cannot be executed in databricks. Once the profile -> databricks.yaml is configured, Databricks does not allow it to run since "Execution is not supported for the current file type. To execute code, use a file extension like .py, .sql, .r, or .scala."

Hyperparameter tuning bug

I am using the hyperopt hyperparameter tuning functionality for a classification model. In the logs I see:

WARNING mlflow.recipes.steps.train: Failed to build tuning results table due to unexpected failure: 'NoneType' object has no attribute 'keys'

Upon examining the source code, it looks like the warning stems from L599 here. best_estimator_params is initialized as None on line 368, but it is not updated by any of the ensuing code until the .keys() call is made on line 599, causing the error.

I am using
mlflow==2.4.1 and Python 3.9.5

multi class classification

Hi,
is there a way to perform multi class ?
i got an error must have a cardinality of 2,found '34' when i omit the positive_class (i saw that if its not None, i cant even see the data, but there is no way to set it to None...

in order to customize it, need to have the capability either to chage:
./mlflow/recipes/steps/ingest/init.py
or
./mlflow/recipes/classification/v1/recipe.py
that to my understanding are not steps that can be customized. any suggestion or reference to multi class classification?
thanks,
Amit

Split step does not work and docs are incorrect

I am using the split_ratios method for the split step, with a custom post-split function. The documentation for the split step states "The post-split method should be written in steps/split.py and should accept three parameters: train_df, validation_df, and test_df."

This makes sense, however the signature in the very next section shows the function only taking 1 parameter, and indicates that the same function is called on all 3 datasets:

def create_dataset_filter(dataset: DataFrame) -> Series(bool):

Which would be fine, except for the fact that this does not happen. The create_dataset_filter function is only called on the training dataset, leaving the validation and testing datasets completely uncleaned. I also cannot pass in 3 parameters to the function as the docs recommends, or else I get create_dataset_filter() missing 2 required positional arguments: 'validation_df' and 'test_df'.

Overall I'm having an incredibly difficult time getting basically anything in this library to work as expected...

There is redundancy in the logging

There is redundancy in their logging, it basically stores the code three times

one in the transformer
another for the training; and
another in the git source commit

positive_class in recipes.yaml not found during transform step when running in Databricks

positive_class is defined and set in recipe.yaml, but is not found when running classification template or classification example, https://github.com/mlflow/recipes-examples/tree/main/classification, in Databricks.

Databricks Runtime: 11.3 ML LTS
mlflow==2.1.1

Full Stacktrace:

---------------------------------------------------------------------------
MlflowException                           Traceback (most recent call last)
File <command-4167555846036620>:1
----> 1 r.run("transform")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-25249b5f-a44c-4b1b-a8b5-94ddb9e63a4d/lib/python3.9/site-packages/mlflow/recipes/classification/v1/recipe.py:267, in ClassificationRecipe.run(self, step)
    195 def run(self, step: str = None) -> None:
    196     """
    197     Runs the full recipe or a particular recipe step, producing outputs and displaying a
    198     summary of results upon completion. Step outputs are cached from previous executions, and
   (...)
    265         classification_recipe.run()
    266     """
--> 267     return super().run(step=step)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-25249b5f-a44c-4b1b-a8b5-94ddb9e63a4d/lib/python3.9/site-packages/mlflow/recipes/recipe.py:104, in _BaseRecipe.run(self, step)
    102 if last_executed_step_state.status != StepStatus.SUCCEEDED:
    103     if step is not None:
--> 104         raise MlflowException(
    105             f"Failed to run step '{step}' of recipe '{self.name}'."
    106             f" An error was encountered while running step '{last_executed_step.name}':"
    107             f" {last_executed_step_state.stack_trace}",
    108             error_code=BAD_REQUEST,
    109         )
    110     else:
    111         raise MlflowException(
    112             f"Failed to run recipe '{self.name}'."
    113             f" An error was encountered while running step '{last_executed_step.name}':"
    114             f" {last_executed_step_state.stack_trace}",
    115             error_code=BAD_REQUEST,
    116         )

MlflowException: Failed to run step 'transform' of recipe 'recipes-classification-template'. An error was encountered while running step 'transform': Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-25249b5f-a44c-4b1b-a8b5-94ddb9e63a4d/lib/python3.9/site-packages/mlflow/recipes/step.py", line 139, in run
    self.step_card = self._run(output_directory=output_directory)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-25249b5f-a44c-4b1b-a8b5-94ddb9e63a4d/lib/python3.9/site-packages/mlflow/recipes/steps/transform.py", line 105, in _run
    validate_classification_config(self.task, self.positive_class, train_df, self.target_col)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-25249b5f-a44c-4b1b-a8b5-94ddb9e63a4d/lib/python3.9/site-packages/mlflow/recipes/utils/step.py", line 182, in validate_classification_config
    raise MlflowException(
mlflow.exceptions.MlflowException: `positive_class` must be specified for classification/v1 recipes.

recipe.yaml

# `recipe.yaml` is the main configuration file for an MLflow Recipe.
# Required recipe parameters should be defined in this file with either concrete values or
# variables such as {{ INGEST_DATA_LOCATION }}.
#
# Variables must be dereferenced in a profile YAML file, located under `profiles/`.
# See `profiles/local.yaml` for example usage. One may switch among profiles quickly by
# providing a profile name such as `local` in the Recipe object constructor:
# `r = Recipe(profile="local")`
#
# NOTE: All "FIXME::REQUIRED" fields in recipe.yaml and profiles/*.yaml must be set correctly
#       to adapt this template to a specific classification problem. To find all required fields,
#       under the root directory of this recipe, type on a unix-like command line:
#       $> grep "# FIXME::REQUIRED:" recipe.yaml profiles/*.yaml
#
# NOTE: YAML does not support tabs for indentation. Please use spaces and ensure that all YAML
#       files are properly formatted.

recipe: "classification/v1"
# FIXME::REQUIRED: Specifies the target column name for model training and evaluation.
target_col: "bool__did_ctp"
# FIXME::REQUIRED: Specifies the value of `target_col` that is considered the positive class.
positive_class: 1
# FIXME::REQUIRED: Sets the primary metric to use to evaluate model performance. This primary
#                  metric is used to select best performing models in MLflow UI as well as in
#                  train and evaluation step.
#                  Built-in primary metrics are: recall_score, precision_score, f1_score, accuracy_score.
primary_metric: "f1_score"
steps:
  # Specifies the dataset to use for model development
  ingest: {{INGEST_CONFIG}}
  split:
    #
    # FIXME::OPTIONAL: Adjust the train/validation/test split ratios below.
    #
    split_ratios: [0.75, 0.125, 0.125]
    #
    #  FIXME::OPTIONAL: Specifies the method to use to "post-process" the split datasets. Note that
    #                   arbitrary transformations should go into the transform step.
    post_split_filter_method: create_dataset_filter
  transform:
    using: "custom"
    #
    #  FIXME::OPTIONAL: Specifies the method that defines an sklearn-compatible transformer, which
    #                   applies input feature transformation during model training and inference.
    transformer_method: transformer_fn
  train:
    #
    # FIXME::REQUIRED: Specifies the method to use for training. Options are "automl/flaml" for
    #                  AutoML training or "custom" for user-defined estimators.
    using: "automl"
  evaluate:
    #
    # FIXME::OPTIONAL: Sets performance thresholds that a trained model must meet in order to be
    #                  eligible for registration to the MLflow Model Registry.
    #
    # validation_criteria:
    #   - metric: f1_score
    #     threshold: 0.9
  register:
    # Indicates whether or not a model that fails to meet performance thresholds should still
    # be registered to the MLflow Model Registry
    allow_non_validated_model: false
  # FIXME::OPTIONAL: Specify the dataset to use for batch scoring. All params serve the same function
  #                  as in `data`
  # ingest_scoring: {{INGEST_SCORING_CONFIG}}
  # predict:
  #   output: {{PREDICT_OUTPUT_CONFIG}}
  #   model_uri: "models/model.pkl"
  #   result_type: "double"
  #   save_mode: "default
# custom_metrics:
#   FIXME::OPTIONAL: Defines custom performance metrics to compute during model development.
#     - name: ""
#       function: get_custom_metrics
#       greater_is_better: False