meta-soul / metaspore Goto Github PK

A unified end-to-end machine intelligence platform

License: Apache License 2.0

CMake 1.84% Shell 0.11% C++ 31.55% C 0.03% Python 39.83% Thrift 0.07% Jupyter Notebook 0.60% Java 25.56% Smarty 0.15% Dockerfile 0.24% Lua 0.02%

abtesting ai deeplearning machinelearning serving training

metaspore's People

Contributors

Stargazers

Watchers

Forkers

ramonzz dmetasoul01 mrhx97 hisstar molierflower sec-dev-ops karllolee coder-drinker tufo830 paramedick staccats bitfact 0x8235 hay-man e-kiss-me xupercoin luluchou n0wwa moguijoe billionerd cerviny ntt720 monsterdove iam20cm minisoco nap1ch jiangyuezhao farmingtong closegoingaway breaklien spicyguml nicbair tutuna vamoko awekling maigone d3p10y herpacker fskeo windb3ll yue2wang lycokie hs991023 mistyr0se wensiyuansix obsidian6s masemxiao s8xy nicolesherwood zaku-zaku excelisa ymzhang96 w90o0u xiao2duan jjslice twacoco paoyes luozhe023 qugou1350636 feiyunwill jinyi-sama aimogmog commachan kamifr raymusk ai2047 zshpro kamlow bartslab leonz87 tqcheung err-nil jtt1998 skillcampalan yetaye stlkoch alexyiy wongli233 lt6253090 sparkcus xuyu67 quantumira hui13579246 reikolo xigua369 zeozez coolume halfloat poyexe jingxio picnicode elvistai f2wong ririkoa van224 klaymr shursulei qmnxxy kuntali luomor-ai

metaspore's Issues

[demo] Text-to-Image Multimodal Retrieval Demo

Text-to-image semantic retrieval demo based on Unsplash Lite 2.5K image dataset. It enables user search image by natural language text.

The demo including the following parts:

online retrieval pipeline service
offline model export, data fetch and index

[serving] consul watch&load deployment on k8s

[demo][bug] Negative sampling category problem

All negative samples have the same category(genre) as the corresponding positive samples. The category should correspond to the item id (movie_id):

[deployment] support spark image build

Spark requires extra setting of entrypoint and user inside container.

[demo] rename the `loan overdue` project to `loan default`

rename the loan overdue project to loan default and fix some runtime errors.

[serving][bug] gcc 11 build would crash for double free

In a lambda submitted to co_spawn, the captured by copy in lambda would be released twice under gcc 11 build.
Possibly related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100611 .
Use capture by ref as a workaround.

[training] support multiple model outputs

[demo] DCN V2

Add DCN V2 to CTR Demo and give benchmark on MovieLens and Criteo datasets.

DCN V2
criteo_d5:
Train AUC: 0.7487 Test AUC: 0.7290
1m uid mid:
Train AUC: 0.8901 Test AUC: 0.8611
25m uid mid:
Train AUC: 0.8888 Test AUC: 0.8323

[algos][demo] Implement DSSM model and demonstrate the effect for recall stage on MovieLens datasets.

Implement DSSM model and demonstrate the effect for recall stage on MovieLens datasets. DSSM paper.

[demo] DCN

Add DCN to CTR Demo and give benchmark on MovieLens and Criteo datasets.

25m uid mid:
Train AUC: 0.8972 Test AUC: 0.8430
1m uid mid:
Train AUC: 0.9021 Test AUC: 0.8746
criteo_d5:
Train AUC: 0.7413 Test AUC: 0.7304

[serving] support preprocessing rpc request with Python preprocessor

[demo] MaximalMarginalRelevanceDiversifier demo

Add diversification algorithms. We implemented a diversification model, named "Maximize Marginal Relevance Disperser" which refers to the paper "The Use of MMR, Diversity-Based Reranking for Reordering The dispersing method mentioned in Documents and Producing Summaries". Compared with SimpleDiversifier, MaximalMarginalRelevanceDiversifier can take into account information in multiple dimensions.
Integrating the MaximalMarginalRelevanceDiversifier into the pipeline after we completed the unit test of MaximalMarginalRelevanceDiversifier. In addition, we have updated the configuration information of the diversify method in the Consul file.
.

[serving] unify metaspore-serving-bin init and load directory structure

Currently, metaspore-serving-bin init and load use different directory structure, where the latter contains model version in the path, which makes it impossible for metaspore-serving-bin to init from a previously loaded model directory.

[code] remove out-dated files

Some files are already outdated, e.g. old compile/build scripts.

[demo] Implement scorecard demo based on loan default probability

Implement scorecard demo based on loan default probability.

[movielens demo] add python.zip when we submit `fg_movielens.py` PySpark job

For the MovieLens Demo, we would better add python.zip before we submit submit fg_movielens.py PySpark job.

def init_spark():
    ## add a line of code here
    subprocess.run(['zip', '-r', './python.zip', 'fg_neg_sampler.py', 'fg_sparse_features_extractor.py', 'fg_gbm_features_extractor.py' ], cwd='./')
    spark = (SparkSession.builder
        .appName('MovieLens Demo')
        .config("spark.executor.memory","10G")
        .config("spark.submit.pyFiles", "python.zip")
        .config("spark.executor.instances","4")
        .config("spark.network.timeout","500")
        .getOrCreate())
    ...

Moreover, please change the function name generate_spare_features to generate_sparse_features in fg_sparse_features_extractor.py

[training][bug] statically linking libgcc_s and libstdc++ could cause dead loop in torch script jit

[serving][bug] build break under gcc

Warnings about lambda with implicit this pointer capture;
multiple definition of `absl::lts_20211102::Status::kMovedFromString

[serving] Support loading model from dir via rpc call

[demo] MMR Diverisifier based LinkList

Update the implementation of MaximalMarginalRelevanceDiversifier, from the original List based implementation to the LinkList based implementation.

[serving][bug] Arrow plan execution is not threadsafe

Arrow plan build and execution in feature compute is not threadsafe and may lead to crash under concurrent requests

[demo] Update HuggingFace model export

Add a general export specification and NLP/CV examples for HuggingFace pre-trained model inference.

[demo] DeepFM

Add DeepFM to CTR Demo and give benchmark on MovieLens and Criteo datasets.

25m uid mid:
Train AUC: 0.8908 Test AUC: 0.8359
1m uid mid:
Train AUC: 0.8891 Test AUC: 0.8658
criteo_d5:
Train AUC: 0.7531 Test AUC: 0.7271

[serving] k8s deployment with helm chart

Kubernetes deployment with helm chart:

serving-chart
movielens-chart

[serving] Python preprocessor uses stdout to pass input_names and output_names back

Currently, the Python preprocessor uses stdout to pass input_names and output_names back to metaspore-serving-bin, which will make exceptions thrown in the Python preprocessor invisible in the log of metaspore-serving-bin. Another mechanism is needed.

[demo] Update Multimodal Retrieval Demo

update multimodal retrieval demo docs, add reference for online part's guide.

[training] implement AdamW updater

[movielens demo] fix the fg.yaml error

Please fix the fg.yaml error, screen shot:

[serving][feature] Supports loading different kinds of models

Serving now just loads tabular model. We need to support loading Onnx model with tensors as inputs.

[algos][demo] Implement semantic retrieval models

Implement dense/semantic retrieval dual-encoder models based on offline and online negatives sampling strategy.

[algos][demo] Implement Multi-Task ESMM model

Implement ESMM model to estimate CVR:

Implement ESMM model
Add Alibaba CPP dataset to test the model performance
Test AUC: 0.6296, Test CTR AUC: 0.5731, Test CVR AUC: 0.6429

[training] Support kubeflow recurring run and model export to consul

Support wrapping recurring run schedule time parameters and training components' parameters
Help fill experiment specific parameters for Estimator
Decouple experiment, scheduling and publish from model runners

[demo] PNN

Add PNN to CTR Demo and give benchmark on MovieLens and Criteo datasets.

iPNN
criteo_d5:
Train AUC: 0.7544 Test AUC: 0.7292
1m uid mid:
Train AUC: 0.8914 Test AUC: 0.8649
25m uid mid:
Train AUC: 0.8916 Test AUC: 0.8362

oPNN
criteo_d5:
Train AUC: 0.7533 Test AUC: 0.7287
1m uid mid:
Train AUC: 0.8896 Test AUC: 0.8633
25m uid mid:
Train AUC: 0.8905 Test AUC: 0.8353

[demo] multi-model retrieval online demo

build one online pipeline for multi-model retrieval scenes.

[demo] Wide&Deep

Add Wide&Deep to CTR Demo and give benchmark on MovieLens and Criteo datasets.

25m uid mid:
Train AUC: 0.8898 Test AUC: 0.8343
1m uid mid:
Train AUC: 0.8937 Test AUC: 0.8682
criteo_d5:
Train AUC: 0.7394 Test AUC: 0.7294

[train][serving] Add docker build and release files

[demo] Creating demo notebooks for MetaSpore usage in Alpha IDE.

Create demo notebooks for MetaSpore usage in Alpha IDE. Users can run it on MovieLens-1M dataset.

[deployment] support creating jupyter/code-server images

[demo] Unify data processing for demo projects

Unify the data processing for movieLens-1m, movielens-25m, criteo-5d and other datasets, including feature generation, match dataset generation, ranking dataset generation, negative sampling, etc.

[demo] QA Multimodal Retrieval Demo

QA is a text-to-text semantic retrieval demo based on 1M Baike-Question-Answer database.

The demo including the following parts

online system: an end-to-end online retrieval services.
offline system: model training and export, data fetch and index.

[training] Support kubeflow pipeline build

Refactor code organization with seperate algo, runner, component and pipeline definitions.
Auto export kubeflow components of built-in algo runners and also a python decorator for customized use.
Load components by name to construct kubeflow pipeline and upload it automatically.

[demo] Add Kolmogorov–Smirnov Test metric

Add Kolmogorov–Smirnov Test metric:

A util of KS-Test to calculate its metric.
A notebook to show how to get statistic of KS-Test and how to plot the KS-Test curve.

[demo] Use ms.nn.Normalization instead of torch.nn.BatchNorm1d

Use ms.nn.Normalization instead of torch.nn.BatchNorm1d:
In parameter server architecture, we should use ms.nn.Normalization which handled global exponentially weighted moving average already.

[demo] AutoInt

Add AutoInt to CTR Demo and give benchmark on MovieLens and Criteo datasets.

AutoInt
criteo_d5:
Train AUC: 0.7558 Test AUC: 0.7361
1m uid mid:
Train AUC: 0.9028 Test AUC: 0.8741
25m uid mid:
Train AUC: 0.8968 Test AUC: 0.8421

[algos][demo] Implement Multi-Task MMoE model

Implement Multi-Task MMoE model:

MMoE net.
MMoE training pipeline (demo).
Preprocessing of Census dataset.

[deployment] support watching consul and notifying loading model

[serving] Check model load type according to directory structure

metaspore-serving-bin tries each model load type sequentially, which will leave extra error logs. A better way would be checking model load type according to directory structure in advance.

[algos][demo] Implement models for loan overdue rate estimation.

We will use dataset to train a LightGBM model for loan overdue rate estimation.

[demo] xDeepFM

Add xDeepFM to CTR Demo and give benchmark on MovieLens and Criteo datasets.

xDeepFM
criteo_d5:
Train AUC: 0.7541 Test AUC: 0.7300
1m uid mid:
Train AUC: 0.8892 Test AUC: 0.8641
25m uid mid:
Train AUC: 0.8911 Test AUC: 0.8367

[serving] IPC framework between cpp and python

Goal

To provide a framework for calling python method in user custom scripts. Python code is executed in a separate process rather than embeded in cpp process. The CPython interpreters are run on a per-thread basis to avoid GIL contention.

Design

Control plane via gRPC and unix domain socket between cpp and python;
Data plane via either gRPC for small data and shared memory for large data;
Model packaged with customized python venv and user scripts;
For each cpp compute thread, run a CPython interpreter process with user entry script;
Provide an async iterator style interface for python.