Introduction

SMRB is a benchmark for service mashup recommendation to solve the problem of irregularities in the field of service mashup recommendation.

This project is built on open source project Lightning-Hydra-Template.

motivation

Currently, deep learning based approaches in service mashup recommendation have common problems, including the non-unified dataset pre-trained model, evaluation rotocol, andexperiment environment. These issues will disrupt evaluating the performance of models accurately and make reproducing them difficult. SMRB rovides a standard environment to enhance comparability between models and credibility of results.

Project Structure

The directory structure of new project looks like this:

│   .env.example                   <- Template of the file for storing private environment variables
│   .gitignore                     <- List of files/folders ignored by git
│   .pre-commit-config.yaml        <- Configuration of pre-commit hooks for code formatting
│   README.md
│   requirements.txt               <- File for installing python dependencies
│   setup.cfg                      <- Configuration of linters and pytest
│   test.py                        <- Run testing
│   train.py                       <- Run training
│
├───configs                        <- Hydra configuration files
│   │   test.yaml                     <- Main config for testing
│   │   train.yaml                    <- Main config for training
│   │
│   ├───callbacks                  <- Lightning callbacks
│   │       wandb.yaml                <- Wandb and metrics callbacks
│   │
│   ├───datamodule                    
│   │       partial_text_bert.yaml    <- Partial text-based dataset embedded by BERT configs
│   │       partial_text_glove.yaml   <- Partial text-based dataset embedded by GloVe configs
│   │       partial_word_bert.yaml    <- Partial word-based dataset embedded by BERT configs
│   │       partial_word_glove.yaml   <- Partial word-based dataset embedded by GloVe configs
│   │       total_text_bert.yaml      <- Total text-based dataset embedded by BERT configs
│   │       total_text_glove.yaml     <- Total text-based dataset embedded by GloVe configs
│   │       total_word_bert.yaml      <- Total word-based dataset embedded by BERT configs
│   │       total_word_glove.yaml     <- Total word-based dataset embedded by GloVe configs
│   │
│   ├───experiment                 <- Experiment configs
│   │
│   ├───hparams_search             <- Hyperparameter search configs
│   │
│   ├───logger                     <- Logger configs
│   │
│   ├───log_dir                    <- Logging directory configs
│   │
│   ├───model                      <- Model configs
│   │
│   └───trainer                    <- Trainer configs
│
├───data                        <- Project data
│
├───logs                        <- Logs generated by Hydra and PyTorch 
                                   Lightning loggers
├───src                         <- Source code
│   │   testing_pipeline.py
│   │   training_pipeline.py
│   │
│   ├───callbacks
│   │       wandb_callbacks.py
│   │
│   ├───datamodules             <- Lightning datamodules
│   │
│   ├───models                  <- Lightning models
│   │
│   ├───utils                   <- Utility scripts
│   │
│   └───vendor                  <- Third party code that cannot be installed using PIP/Conda
│
└───tests                       <- Tests of any kind
    │
    ├───helpers                    <- A couple of testing utilities
    │
    ├───shell                      <- Shell/command based tests
    │
    └───unit                       <- Unit tests

Installation

You can install environment by anaconda or docker, and then download the dataset.

Install with anaconda

# clone project
git clone https://github.com/ssnowyu/SMRB
cd SMRB

# create conda environment
conda create -n myenv python=3.8
conda activate myenv

# install requirements
pip install -r requirements.txt

Install with docker

You will need to install Nvidia Container Toolkit to enable GPU support.

# clone project
git clone https://github.com/ssnowyu/SMRB
cd SMRB

# build the container
docker build -t <project_name> .

# mount the project to the container
docker run -v $(pwd):/workspace/project --gpus all -it --rm <project_name>

Download the dataset

Download dataset from here, unzip it and move it to the data directory in the project.

Quickstart

Implement the DL-based model as the LightningModule class. Details refer to Model Implementation. Here the MLP model (pre-configured in our project) is used as an example.

Write a configuration file called simple_model for your model.

 _target_: src.models.mlp.MLP

 data_dir: ${data_dir}/api_mashup
 api_embed_path: embeddings/partial/text_bert_api_embeddings.npy
 mashup_embed_channels: 300
 mlp_output_channels: 300
 lr: 0.001
 weight_decay: 0.00001

Write a configuration file called mlp for your experiment.

 # @package _global_

 defaults:
     - override /trainer: default.yaml              # use default settings for trainer
     - override /model: mlp.yaml                    # use "mlp" as model
     - override /datamodule: partial_text_bert.yaml # use partial text-based embeddings encoded by BERT
     - override /callbacks: wandb.yaml              # use wandb as the callbacks
     - override /logger: wandb.yaml                 # use wandb as the log framework

 seed: 12345

 logger:
     wandb:
         name: 'MLP-partial-BERT'
         tags: ['partial', 'BERT']

 # Override model parameters
 model:
     api_embed_path: embeddings/partial/text_bert_api_embeddings.npy
     mashup_embed_channels: 768
     mlp_output_channels: 300
     lr: 0.001

Since the project uses wandb as the log framework by default, you will need to have a wandb account and bind the account to the project by executing the following command.
```
wandb login
```
This command needs to be executed only once during the entire development process.

If you do not want to use wandb, you can also choose another log framework. Please refer to LightningModule for how to change it.

run the project.

python train.py experiment=mlp/partial_bert

Guide

Choose dataset

We provides 8 datasets, as shown in the following table:

name	form	pre-trained model	amount
partial_text_bert	text-based	BERT	partial
partial_text_glove	text-based	GloVe	partial
partial_word_bert	word-based	BERT	partial
partial_word_glove	word-based	GloVe	partial
total_text_bert	text-based	BERT	total
total_text_glove	text-based	GloVe	total
total_word_bert	word-based	BERT	total
total_word_glove	word-based	GloVe	total

Two forms

$(72 \times d)$. The original form processed by word embedding model, and each word corresponds to a vector whose size is $(1 \times d)$. This form is suitable for word-based representation.
$(1 \times d)$. Pooling the representation whose size is $(72 \times d)$ by averaging to a representation whose size is $(1 \times d)$, with which the whole text is represented. This form is suitable for text-based representation.

Two amount

total: 21495 APIs, including some unused APIs.
partial: 932 APIs, all of which have been used at least once.

Two pre-trained models

BERT: A global log-bilinear regression model for unsupervised learning of word representations.
GloVe: Bidirectional Encoder Representations from Transformer.

Model implementation

You need to implement the DL-based model as the LightningModule class. What you need to do are:

return the loss in training_step to guide model training.
return the predicted result preds and the true result targets in test_step. Then, the callback mechanism will capture them and calculates the performance on each metric.

Experiment configuration

Organizing experiments by combining existing elements. Take simple-model as an example.

Pre-configured models

MTFM: This method uses GloVe as the pre-training model and extracts the semantic features in the description of the mashup by CNN. Then, the obtained features are processed by a feature interaction component to extract interaction features with Web API embedding vectors. Finally, the semantic and interaction features to predict the candidate Web APIs for the mashup description.
coACN: This method classifies the Web APIs using category information to obtain the service domain and embeds the service domain into the mashup embedding vector by a domain-level attention unit. And then constructs a service combination graph with the invocation relationship between mashups and Web APIs and extracts collaborative relationships using LightGCN. Finally, the probability of invocation between mashup and API is predicted by MLP.
MISR: This method uses CNN to extract semantic features in the description, node2vec to obtain implicit neighbor interaction features from the mashup-Service Invocation Matrix, and MLP to obtain direct neighbor interaction features from the mashup-Service Invocation Vector. Finally, the above three features are fused, and the results are predicted using MLP. It is worth noting that the three feature extraction modules can be trained separately and then combined with fine-tuning on the real data to obtain the final complete model. In this case, we remove the Neighbor Interaction Component of MISR to avoid potential data leaking caused by node2vec encoding based on all invocations between mashups and APIs.
FISR: The model consists of three layers. The feature Extraction Layer gets the feature representation of the target mashup, selected APIs, and the following API to be recommended based on the text description of mashups and APIs. The interaction Layer learns the relevance weight between each selected and candidate service using an attention mechanism. Output Layer calculates the probability of recommending the candidate service to the target mashup in the next round.
T2L2: This method consists of only three linear layers. The first two linear layers align the representation space of mashups and Web APIs by bridging the semantic gap between mashups and Web APIs. The third linear layer calculates the matching scores of mashups and Web APIs. The second of these linear layers is part of the propagation component, which is used to incorporate mashup information into the representation vectors of the Web APIs.
T2L2-W/O-Propagation: This method is based on a modification of T2L2, which requires the training data to be ordered chronologically. Since our training data is disordered, the propagation component of T2L2 may propagate inaccurate information into the APIs' representation vector, affecting the model's outcomes. Therefore, we remove the propagation component from T2L2 and obtain T2L2-W/O-Propagation.
MLP: A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN), which consists of some fully connected operation and activation functions. It is a very simple and commonly used neural network structure. We created the model as an MLP for comparison with other models. It processes the feature for the mashup and the Web API with an MLP with two linear layers to predict the matching scores.
Freq: This method always recommends the top $N$ frequently invoked Web APIs.

ssnowyu / smrb Goto Github PK

smrb's Introduction