seldonio / mlserver Goto Github PK

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more

Home Page: https://mlserver.readthedocs.io/en/latest/

License: Apache License 2.0

Makefile 0.26% Python 97.92% Shell 0.70% Dockerfile 0.46% Jinja 0.11% JavaScript 0.55%

machine-learning scikit-learn xgboost lightgbm mlflow seldon-core kfserving

mlserver's Introduction

MLServer

An open source inference server for your machine learning models.

Overview

MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. Watch a quick video introducing the project here.

Multi-model serving, letting users run multiple models within the same process.
Ability to run inference in parallel for vertical scaling across multiple models through a pool of inference workers.
Support for adaptive batching, to group inference requests together on the fly.
Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe (formerly known as KFServing), where MLServer is the core Python inference server used to serve machine learning models.
Support for the standard V2 Inference Protocol on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks.

You can read more about the goals of this project on the initial design document.

Usage

You can install the mlserver package running:

pip install mlserver

Note that to use any of the optional inference runtimes, you'll need to install the relevant package. For example, to serve a scikit-learn model, you would need to install the mlserver-sklearn package:

pip install mlserver-sklearn

For further information on how to use MLServer, you can check any of the available examples.

Inference Runtimes

Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. You can read more about inference runtimes in their documentation page.

Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common frameworks. This allows you to start serving models saved in these frameworks straight away. However, it's also possible to write custom runtimes.

Out of the box, MLServer provides support for:

Framework	Supported	Documentation
Scikit-Learn	✅	MLServer SKLearn
XGBoost	✅	MLServer XGBoost
Spark MLlib	✅	MLServer MLlib
LightGBM	✅	MLServer LightGBM
CatBoost	✅	MLServer CatBoost
Tempo	✅	`github.com/SeldonIO/tempo`
MLflow	✅	MLServer MLflow
Alibi-Detect	✅	MLServer Alibi Detect
Alibi-Explain	✅	MLServer Alibi Explain
HuggingFace	✅	MLServer HuggingFace

MLServer is licensed under the Apache License, Version 2.0. However please note that software used in conjunction with, or alongside, MLServer may be licensed under different terms. For example, Alibi Detect and Alibi Explain are both licensed under the Business Source License 1.1. For more information about the legal terms of products that are used in conjunction with or alongside MLServer, please refer to their respective documentation.

Supported Python Versions

🔴 Unsupported

🟠 Deprecated: To be removed in a future version

🟢 Supported

🔵 Untested

Python Version	Status
3.7	🔴
3.8	🔴
3.9	🟢
3.10	🟢
3.11	🔵
3.12	🔵

Examples

To see MLServer in action, check out our full list of examples. You can find below a few selected examples showcasing how you can leverage MLServer to start serving your machine learning models.

Developer Guide

Versioning

Both the main mlserver package and the inference runtimes packages try to follow the same versioning schema. To bump the version across all of them, you can use the ./hack/update-version.sh script.

We generally keep the version as a placeholder for an upcoming version.

For example:

./hack/update-version.sh 0.2.0.dev1

Testing

To run all of the tests for MLServer and the runtimes, use:

make test

To run run tests for a single file, use something like:

tox -e py3 -- tests/batch_processing/test_rest.py

mlserver's People

Contributors

Stargazers

Watchers

Forkers

adriangonz srinivaas14 fogdong kleveross lynnmatrix gyannetics qilingu hylinktree rafalskolasinski ai-bigdata-cloudnative tomasatdatabricks aaad haijohn lianxmfor sakoush axsaucedo trendingtechnology michaelcheah sachinvarghese jesufemi-o yiksanchan agrski devbox10 alexander-manley aniszoubiramar liangtsao joerunde jakeneyer whiteeyehansel ivanfadillahgdp iamahern lada-kunc mahnoor507 ivan-valkov taoufiq0 soubenz techthiyanes webclinic017 aomsson mert-kirpici martapodarta njhill dinever schanne-job kspitale lingkar kurianbenoy-sentient shane-breeze vanducng theofpa pablobgar pauledwardbrennan adahmed-coursera nandev roeap pvaneck 521bibi stjordanis erik-hasse kioco evangeliakoleli johnpaulett jeekim srikarprayaga06 deli816 vinay-menon123 amit32624 juxtafresh illeatmyhat valteresj2 skel84 lileilai pepesi franklinharry saeid93 vishalsingh17 pbezglasny joshsgoldstein yaliqin lincoln2491 dumaas kengoa niklasrefsgaard salehbigdeli kflow-ai miguelopind smolendawid lizzzcai mtahir19 mlhafizur ashutoshtiwari05 acheqrhermini abeusher alvarorsant jgallardorama-itx galic-vlad vimagh rafvasq regen100 knowledgehacker

mlserver's Issues

Migrate tests to GH actions

Now that we have GH actions in SC, it would be good to also bring them over to MLServer

[gRPC] Model outputs get ignored

Hi,

thank you for this nice project. I am using the following go code to create a gRPC request for accessing a sklearn model. In my understanding the model should use the predict_proba() method and return it as output "predict_proba", however I only get the output "predict" returned.

	return &pb.ModelInferRequest{
		ModelName: c.modelName,
		Outputs: []*pb.ModelInferRequest_InferRequestedOutputTensor{
			{
				Name: "predict_proba",
			},
		},
		Inputs: []*pb.ModelInferRequest_InferInputTensor{
			{
				Name:     c.modelName,
				Datatype: "FP32",
				Shape:    []int64{1, int64(len(profile))},
				Contents: &pb.InferTensorContents{
					Fp32Contents: profile,
				},
			},
		},
	}

If I add an output {Name:"test"} the code in https://github.com/SeldonIO/MLServer/blob/0936e26354a6df77c7692faaa9c9467f5f674573/runtimes/sklearn/mlserver_sklearn/sklearn.py should raise an InferenceError Exception in line 58 but this does not happen, I assume that payload.outputs is None. I have also tested adding another input, this leads to the Exception in Line 46.

Am I setting the Outputs accordingly? How can I access sklearn's predict_proba method?

Add support for tracing

Trace each inference step within MLServer. These traces can be pushed to Jaeger or similar OpenTracing backends.

Add support to provide feedback

Seldon Core currently supports providing a “reward signal” as feedback for model’s predictions. This is received by the model as a request sent to a /feedback endpoint. Since this modifies the server protocol, it would be good to consider adding this as a server extension for MLServer.

Add lockfile to MLServer

To better support use cases where mlserver is used as a library, we shouldn't restrict the dependencies versions too much. Instead, we should look into adding some sort of lockfile (e.g. through Poetry or Conda) that locks the versions in the Docker image (so that the "app-level" environment is kept consistent).

Add support for raw_input_contents field

Add support for top-level raw_input_contents in gRPC spec. This field was initially introduced to work-around some performance issues within gRPC.

You can find more details on these issues:

New test issue

Package up environment as part of MLServer CLI

MLServer now has built-in support to unpack and activate a conda-pack tarball. This feature could be leveraged to run the custom environment defined in theconda.yaml file usually present in MLflow model artifacts. Since MLServer expects a tarball, this issue should explore best practices on going from a conda.yaml file to a conda-pack tarball, reducing the potential friction for the user.

One potential solution is to include a utility on the mlserver-mlflow package which bridges this gap for the users.

V2 Inference Format deviation

Describe the bug

When I apply a XGB model using the Seldon-Core XGB server, the V2 Inference returns a non standard response. The outputs.data tensor should be a flattened array. Seldon currently responds with a np tensor.

To reproduce

According to the spec, inference request examples should be return the shape of the data. However the data itself is a flattened array.

My data object should be [0.005064773838967085, 0.007540544960647821, 0.9873946905136108, 0.9904587268829346, 0.005667048040777445, 0.00387418526224792]

With Seldon right now, the array is a multidimensional np data frame.

{'model_name': 'classifier', 'model_version': 'v1', 'id': '0', 'parameters': None, 'outputs': [{'name': 'predict', 'shape': [2, 3], 'datatype': 'FP32', 'parameters': None, 'data': [[0.005064773838967085, 0.007540544960647821, 0.9873946905136108], [0.9904587268829346, 0.005667048040777445, 0.00387418526224792]]}]}

Environment

Model Details

Images of your model: [Output of: kubectl get seldondeployment -n <yourmodelnamespace> <seldondepname> -o yaml | grep image: where <yourmodelnamespace>]
Logs of your model: [You can get the logs of your model by running kubectl logs -n <yourmodelnamespace> <seldonpodname> <container>]

YAML Spec through Seldon Core

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: xgboost
  labels: 
    mlctl_name: seldon-xgb-iris
    mlctl_type: model_deployment
spec:
  name: iris
  protocol: kfserving # Activate the V2 protocol
  predictors:
  - graph:
      children: []
      implementation: XGBOOST_SERVER
      modelUri: s3://iris
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Test issue

Support other common names for SKLearn runtime

Add support for models named model.pickle and model.pkl

Publish in PyPi

The mlserver package is currently not published in PyPi. This leads to work arounds (e.g. pip install git+github.com/SeldonIO/mlserver) which are not ideal.

Explore content parsing / casting

Seldon Core currently allows to parse / cast the user’s payload into a friendlier format. For example, if the user sends a payload such as {“jsonData”: {“foo”: “bar”}}, their def predict(payload) method will receive a JSON object. Likewise, if they send a payload such as {“data”: {“ndarray”: [0, 1, 2, 3]}}, their predict method will receive a Numpy array.

Currently, the parsed types supported by SC are:

Strings, sent as strData.
JSON (also valid through gRPC), sent as jsonData.
Binary data, encoded as base64, sent as binData.
Numpy arrays, sent as data.ndarray or data.tensor.
TFServing arrays, sent as tftensor.

However, the V2 Data Plane only allows to send data as either BYTES, BOOL or numeric formats (e.g. FP32). It could be useful to extend this, so that the user can provide more information about the payload, which can then allow MLServer to cast it to other types.

Proposal

The idea is that the server reads the raw content from the data field (as per
the V2 data plane), but then looks at an extra key under parameters.content_type which
dictates how that content should be parsed. This could mean reading it as a Numpy array, a JSON dictionary or any other type.

This extension could be implemented as a middleware (going in and out).

Example for a string

The datatype can be raw BYTES, but setting parameters.content_type to string.

{
  "datatype": "BYTES",
  "data": "this is my query",
  "parameters": {
    "content_type": "string"
  }
}

A server with this extension could then read the data field and decode it as UTF8 to treat it as a string.

Example for a Numpy array

The datatype can be set to FP32, but the user could have the capability of specifying that this should be treated as a Numpy array by setting parameters.content_type to ndarray.

{
  "datatype": "FP32",
  "data": [0, 12, 2, 3, 4, 5],
  "parameters": {
    "content_type": "ndarray"
  }
}

A server with this extension could then read the data field and parse that as a Numpy array.

Example for JSON

The datatype can be raw BYTES, but setting parameters.content_type to
json.

{
  "datatype": "BYTES",
  "data": "{ 'foo': 'bar' }",
  "parameters": {
    "content_type": "json"
  }
}

A server with this extension could then read the data field and parse that as
JSON.

Example for an image

The datatype and data fields would respect the V2 API, but the user can
specify an image type through parameters.content_type.

{
  "datatype": "FP32",
  "data": [0, 1, 2, 3, 3, 4, 5, 6, 6, 67, 7, 7],
  "parameters": {
    "content_type": "image"
  }
}

The server could then read the data and transform it into PIL.Image.

Support Alibi Detect as Prepackaged Server

Currently the alibi detect server can be a prepackaged server in itself as it has a similar workflow to the normal models. This means that it could use a lot of the similar workflows involved in the alibi-detect-server in Seldon Core to build a base that could import models for outlier detectors, drift detectors and adv detectors from Alibi Detect.

Add support for poll mode

Watch for changes on the model repository to refresh models automatically whenever there are new versions of the model artifacts.

Support MLflow current protocol

As a follow-up to #167, it would be interesting to explore adding a custom endpoint to the mlserver-mlflow runtime which supports MLflow's existing API. This would help reduce friction on user adoption of MLSever, as well as a temporary stopgap for users while they adopt the V2 protocol.

Explore buildpacks

It seems that buildpacks offer an easy way to go from code to image. This could be leveraged by MLServer to ease the process of building custom inference runtimes.

Expose MLflow's model signature types as "annotated" model metadata

As a follow up to #163 and #164, it would be interesting to explore whether it's possible to expose the types information missed when converting from MLflow's model signature to the V2 Dataplane through annotations in the V2 model metadata.

For example, in the case of MLflow's string type, one could annotate the V2 Dataplane metadata for that particular input saying that even though it should be encoded as BYTES at the low-level, it's meant to be compatible with MLflow's string type.

Multi-language Runtimes

Create multi-language wrappers that can run Java, C++ and R models. For this, we can leverage the existing research in Seldon Core which leverages tools like JNI and PyBind to bridge from Python to other runtimes.

New new test issue

Use datatype in payload to cast value

In the data field (of ResponseInput and RequestInput), the value itself is not enough to infer the right type. We need to use the datatype attribute to cast the value to the right type.

For that, we can add a custom validator for the TensorData type: https://pydantic-docs.helpmanual.io/usage/validators/

Infer model's name from folder's name

Currently, particularly in the case of MMS, MLServer requires models to specify their name in a model-settings.json file. This forces all models to ship that file alongside their model artifacts.

It would be good to, instead, infer the model's name from the folder name (if not present in env). This would reduce friction on adopting MLServer, the V2 protocol and MMS.

Support LightGBM model

Support LightGBM, include it as an inference runtime

Multi-Process support

Handle multiple concurrent connections that bypass GIL.

Add support for Alibi Explain

It would be good to add a custom extension that exposes an "explain" endpoint, similar to what's currently exposed in SC and KFServing. This would involve loading an explainer as a model.

Use TrainedModel CRD schema for ModelSettings

Look into using the new TrainedModel CRD introduced in KFServing as schema for ModelSettings.

Allow to load settings and model-settings from CLI flags

Currently, mlserver relies on having settings.json and model-settings.json files present or falling back to environment variables. It would be good to also allow users to specify these flags directly through the CLI.

For that, we should look for an integration between Pydantic (what we use to define the settings parameters) and some CLI library. We are currently using click for our CLI, but it doesn't seem that both projects are integrated.

Add support for dataframe inputs to MLflow models

As a follow-up to #160, some MLflow models require their inputs to be in the form of a Pandas Dataframe. In order to support these models, it would be good if the mlserver-mlflow runtime could convert V2 payloads to Pandas Dataframes.

Note that this should be straightforward for V2 payloads with multiple "input heads", where each input head can be treated as a column. That is, a payload such as:

{
  "inputs": [
    {
        "name": "a",
        "data": [1, 4],
        "shape": [2],
        "datatype": "INT32"
    },
      {
        "name": "b",
        "data": [2, 5],
        "shape": [2],
        "datatype": "INT32"
    },
      {
        "name": "c",
        "data": [3, 6],
        "shape": [2],
        "datatype": "INT32"
    }
  ]
}

, could be encoded to the following Pandas Dataframe:

{
    "columns": ["a", "b", "c"],
    "data": [[1, 2, 3], [4, 5, 6]]
}

Pick up the release pull requsets missing issue in github

Add support for custom gRPC endpoints

On top of #167, it would be great to extend the support for custom endpoints to gRPC calls as well. However, it's not clear at the moment whether this is easily achievable.

Explore "tags" extension to model metadata

The model metadata, as defined by the V2 Dataplane, currently has a hardcoded set of data types to specify for each input (e.g. BYTES, INT32, etc.). While these types are great to encode lower-level encodings, they can fall short on some cases where we need to supply information about a "richer" higher-level data type.

For example, we can think of an image input. We can currently encode image objects as BYTES inputs. However, the model metadata doesn't provide any information about how should this encoding look like. Should it be RGB, BGR, 8-bit Greyscale? What image size does the model expect?

Since this information can be quite arbitrary (and probably shouldn't be explicitly defined in the protocol), it could be interesting to explore extending the model metadata schema to support a simple (1-level) string-to-string dictionary which lets the user encode information, such as how should an input be encoded.

Content conversion / casting

To prove some of the value of this extension, this issue should explore how can MLServer leverage these extra information. During the early stages of the project, there was a discussion of how we could implement a "payload conversion" pipeline. This pipeline could use some of the extra metadata, to convert "raw inputs" (e.g. BYTES), to fully decoded objects (e.g. a pillow.Image instance).

Expose MLflow's model signature as metadata

MLflow models can define a model signature. The information contained in this model signature has a few parallelisms to the V2 Dataplane's model metadata. Therefore, it would be good to explore how we could convert from the model signature (which should be present in a MLflow's model artifact) to the model metadata "in-the-fly" so that the same information can still be exposed.

Note that there is currently a mismatch between MLflow's native types and the V2 Dataplane accepted data types (e.g. string vs BYTES). Therefore, there may be some loss of information when converting from the former to the latter format. We can explore further down the line how this loss of information can be minimised.

Add MLflow runtime with basic payload support

Add a new mlserver-mlflow runtime which allows a user to point to a MLflow Model artifact (or folder) to load a model. As a initial step, the mlserver-mlflow runtime should take care of converting the V2 Dataplane payload to a "dict of tensors", which is one of the formats expected by MLflow models.

To translate this, we could just turn the V2 input into an "index", where the keys would be the inputs[].name fields. That is, an input such as:

{
  "inputs": [
    {
        "name": "a",
        "data": [1, 4],
        "shape": [2],
        "datatype": "INT32"
    },
      {
        "name": "b",
        "data": [2, 5],
        "shape": [2],
        "datatype": "INT32"
    },
      {
        "name": "c",
        "data": [3, 6],
        "shape": [2],
        "datatype": "INT32"
    }
  ]
}

, could be turned to the following MLflow-compatible dictionary of tensors:

{
    "inputs": {
          "a": ["s1", "s2", "s3"], 
          "b": [1, 2, 3], 
          "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
      }
}

Tensors vs Dataframes

While some MLflow models require their inputs to be encoded as dataframes, some others will still need a dictionary of tensors (see #160). To account for this, the scope of this issue includes looking at ways to infer which type of input does a model require, as well as providing a way for the user to choose which input type to use.

The latter could be done through the V2 Protocol's inputs[].parameters field, by setting a "magic key" (e.g. mlflow_encoding: dataframe). This key can then be read by the mlserver-mlflow runtime to choose one encoding or the other.

runtime packages have no source

I'm not a python expert but I think there is an issue with python packages for mlserver runtimes.
If I do the following:
pip install mlserver mlserver-xgboost mlserver-sklearn
It appears to install all 3 packages however only the mlserver has a source directory

First to show where package is installed:

(xgboost-env) williao@williao-G3-3579:~/dev/python/xgboost/sklearn-demo$ pip show mlserver-xgboost
Name: mlserver-xgboost
Version: 0.2.0
Summary: XGBoost runtime for MLServer
Home-page: https://github.com/SeldonIO/MLServer.git
Author: Seldon Technologies Ltd.
Author-email: [email protected]
License: Apache 2.0
Location: /home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages
Requires: mlserver, xgboost
Required-by:

Then if I list the mlserver package dirs

ls /home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver*
/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver:
cli  errors.py  grpc  handlers  __init__.py  model.py  __pycache__  registry.py  repository.py  rest  server.py  settings.py  types  utils.py  version.py

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver-0.2.0.dist-info:
entry_points.txt  INSTALLER  LICENSE  METADATA  RECORD  top_level.txt  WHEEL

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_sklearn-0.2.0.dist-info:
INSTALLER  LICENSE  METADATA  RECORD  top_level.txt  WHEEL

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_xgboost-0.2.0.dist-info:
INSTALLER  LICENSE  METADATA  RECORD  top_level.txt  WHEEL

you can see that mlserver package has a source dir (mlserver) and a dist-info dir (mlserver-0.2.0.dist-info)
but mlserver-xgboost and mlserver-sklearn only have dist-info dir.

If I run mlserver start I get an error that package/module is missing:

mlserver start .
implementation
  ensure this value contains valid import path or valid callable: No module named 'mlserver_sklearn' (type=type_error.pyobject; error_message=No module named 'mlserver_sklearn')

The workaround for me is to run pip install -r requirements.txt on a file containing:

git+https://github.com/seldonio/mlserver#egg=mlserver
git+https://github.com/seldonio/mlserver#egg=mlserver-xgboost&subdirectory=runtimes/xgboost
git+https://github.com/seldonio/mlserver#egg=mlserver-sklearn&subdirectory=runtimes/sklearn

After this new install there is a source directory:

(xgboost-env) williao@williao-G3-3579:~/dev/python/xgboost$ ls /home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlse*
/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver:
cli  errors.py  grpc  handlers  __init__.py  model.py  __pycache__  registry.py  repository.py  rest  server.py  settings.py  types  utils.py  version.py

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver-0.2.1.dev0-py3.7.egg-info:
dependency_links.txt  entry_points.txt  installed-files.txt  PKG-INFO  requires.txt  SOURCES.txt  top_level.txt

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_sklearn:
__init__.py  __pycache__  sklearn.py  version.py

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_sklearn-0.2.1.dev0-py3.7.egg-info:
dependency_links.txt  installed-files.txt  PKG-INFO  requires.txt  SOURCES.txt  top_level.txt

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_xgboost:
__init__.py  __pycache__  version.py  xgboost.py

/home/williao/dev/python/xgboost/xgboost-env/lib/python3.7/site-packages/mlserver_xgboost-0.2.1.dev0-py3.7.egg-info:
dependency_links.txt  installed-files.txt  PKG-INFO  requires.txt  SOURCES.txt  top_level.txt

and mlserver start . works

Instrument MLServer (tracing, metrics & logging)

Add support in MLServer to report metrics, logging and tracing through OpenTelemetry.

Empty fields (except `name`) in the returned metadata when using mlserver-sklearn in KFServing

Hi experts,

I was trying to follow the steps listed in https://github.com/kubeflow/kfserving/tree/master/docs/samples/v1beta1/sklearn/v2 to setup an InferenceService supporting V2 protocol. The infer interface works fine. However when I tried to retrieve the Model MetaData via requests like:

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/sklearn-irisv2

All I got is

{"name":"sklearn-irisv2","versions":[],"platform":"","inputs":[],"outputs":[]}

Is this expected or am I miss anything? I want to get the intput/ouput tensor metadata.

Let me know if I should post it in KFServing repo instead. Thanks for the help!

User Documentation

Support custom endpoints and payloads

The V2 Dataplane allows for the concept of “extensions”. That is, endpoints outside of the V2 Protocol specification, which take any arbitrary payload. However, this would usually require outlining the full gRPC spec of the new endpoint. An alternative, could be to provide a new endpoint which only works under HTTP. While this may be sub-optimal, it would still offer a way for users to write their own custom protocols.

The capability to provide new endpoints would be introduced at the runtime level. That is, an inference runtime would be able to register a “custom endpoint”.

class CustomRuntime(MLModel):
    # ...

    @mlserver.custom_endpoint("/my-custom-endpoint")
    def invocations(self, payload: dict) -> dict:
        # Parse custom protocol payload and call model
        pass

This would register an endpoint as /models/<model-name>/versions/<version>/<custom-path>. When a request is sent to this endpoint, MLServer would check whether the inference runtime used for model <model-name> supports the custom endpoint with path <custom-path>. If it doesn’t, it would then return a 404. Otherwise, it would route the request to a method registered as the handler for the custom endpoint in the inference runtime.

Add support for metrics

Add a custom extension which exposes a metrics API that can be scrapped by Prometheus. We can base this on Triton's statistics extension to ensure API-wise parity.

We could also use this chance to explore OpenTelemetry and whether we could expose vendor-agnostic metrics.

Add inference graphs support

Currently we only expose the predict method. However, some orchestrations frameworks like SC support the use of other "inference steps", like routing or aggregation. It would be good to explore how these could be added, probably as a custom extension.

There was an early discussion on how this could be done here: https://docs.google.com/presentation/d/1uHg7qfZxivygo5E-ExcChVqNKqoliYJrJc_j3I6VrkU/edit?usp=sharing

MLflow Inference Runtime

For more information, see theMLflow & MLServer design doc.

Handle multiple models with custom endpoints

Following up from #167, there are a few things to take into account before adding support for custom endpoints across multiple models.

At the moment we just load the route as-is, without namespacing it with the model name. This means that loading multiple models could lead to clashes. For example, let's think of an inference runtime which registers a custom endpoint with path /foo. After loading 10 model instances using that runtime, it wouldn't be clear any more which one should be used to serve the custom endpoint with path /foo.

Some solutions that could be explored are:

Disable custom endpoints when MMS is enabled. This would require adding the option to disable MMS.
Namespace the custom path, e.g. registering /v2/models/<model-name>/versions/<model-version>/foo instead of just /foo. This could be easy to tackle, but it's not clear whether the custom endpoint would be as useful after adding the extra prefix (i.e. mainly as it would make it incompatible with legacy upstream services).

Parallel Inference

With the support for MMS, there is a question of how we can support parallel inference at the same time across multiple models. The main blocker for this is Python's GIL, which blocks true parallelism within the same process.

The scope of this issue is to explore different alternatives (e.g. using multiprocessing) to support parallel inference.

Add support for Model Repository API

Add support to load / unload models dynamically through an API which controls the model repository. This API can be modelled based on this Triton extension: https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md

Alibi Runtime

Create an*"inference"** runtime that lets you run Alibi Detectors and Explainers.

Getting started and usage documentation

There are currently no docs for MLServer. We should work on add a initial set of documentation that allows users to get kickstarted on using MLServer. Initially, we could focus on just a "getting started" and "usage" section to the README.md page.

Add request-level content type

Following #163, it would be good to extend the content_type concept to entire requests. That is, encoding / decoding a set of inputs that go together, like a DataFrame or a dictionary of tensors.

Support HTTPS for model downloading

Might be worth upstreaming this KFServing PR here: kserve/kserve#979

Add logging interceptor to gRPC server

Currently there are no logs in the gRPC server. For that, we need to add a (custom?) interceptor that logs a message for each incoming request.

Convert input types based on MLflow's model signature

MLflow models provide a typing mechanism through their model signatures. These types are used to validate the input payloads, therefore is important that the mlserver-mlflow runtime takes them into account when converting from the V2 Protocol to an MLflow-valid input (see #160 and #161).

There is currently a mismatch between the types supported by the V2 Dataplane and MLflow native types. For example,

MLflow allows you to set an input column as a string, whereas BYTES is the closest type in the V2 Dataplane.
MLflow allows you to set an input column as binary, which is expected to be a base64 string. This is different to the BYTES field in the V2 Dataplane (which doesn't specify explicitly how these field needs to be transmitted).

This conversion will also need to take into account how to convert between the V2 "lower-level" types and MLflow's native types.