Code Monkey home page Code Monkey logo

smrb's Introduction

Introduction

SMRB is a benchmark for service mashup recommendation to solve the problem of irregularities in the field of service mashup recommendation.

This project is built on open source project Lightning-Hydra-Template.

motivation

Currently, deep learning based approaches in service mashup recommendation have common problems, including the non-unified dataset pre-trained model, evaluation rotocol, andexperiment environment. These issues will disrupt evaluating the performance of models accurately and make reproducing them difficult. SMRB rovides a standard environment to enhance comparability between models and credibility of results.

Project Structure

The directory structure of new project looks like this:

│   .env.example                   <- Template of the file for storing private environment variables
│   .gitignore                     <- List of files/folders ignored by git
│   .pre-commit-config.yaml        <- Configuration of pre-commit hooks for code formatting
│   README.md
│   requirements.txt               <- File for installing python dependencies
│   setup.cfg                      <- Configuration of linters and pytest
│   test.py                        <- Run testing
│   train.py                       <- Run training
│
├───configs                        <- Hydra configuration files
│   │   test.yaml                     <- Main config for testing
│   │   train.yaml                    <- Main config for training
│   │
│   ├───callbacks                  <- Lightning callbacks
│   │       wandb.yaml                <- Wandb and metrics callbacks
│   │
│   ├───datamodule                    
│   │       partial_text_bert.yaml    <- Partial text-based dataset embedded by BERT configs
│   │       partial_text_glove.yaml   <- Partial text-based dataset embedded by GloVe configs
│   │       partial_word_bert.yaml    <- Partial word-based dataset embedded by BERT configs
│   │       partial_word_glove.yaml   <- Partial word-based dataset embedded by GloVe configs
│   │       total_text_bert.yaml      <- Total text-based dataset embedded by BERT configs
│   │       total_text_glove.yaml     <- Total text-based dataset embedded by GloVe configs
│   │       total_word_bert.yaml      <- Total word-based dataset embedded by BERT configs
│   │       total_word_glove.yaml     <- Total word-based dataset embedded by GloVe configs
│   │
│   ├───experiment                 <- Experiment configs
│   │
│   ├───hparams_search             <- Hyperparameter search configs
│   │
│   ├───logger                     <- Logger configs
│   │
│   ├───log_dir                    <- Logging directory configs
│   │
│   ├───model                      <- Model configs
│   │
│   └───trainer                    <- Trainer configs
│
├───data                        <- Project data
│
├───logs                        <- Logs generated by Hydra and PyTorch 
                                   Lightning loggers
├───src                         <- Source code
│   │   testing_pipeline.py
│   │   training_pipeline.py
│   │
│   ├───callbacks
│   │       wandb_callbacks.py
│   │
│   ├───datamodules             <- Lightning datamodules
│   │
│   ├───models                  <- Lightning models
│   │
│   ├───utils                   <- Utility scripts
│   │
│   └───vendor                  <- Third party code that cannot be installed using PIP/Conda
│
└───tests                       <- Tests of any kind
    │
    ├───helpers                    <- A couple of testing utilities
    │
    ├───shell                      <- Shell/command based tests
    │
    └───unit                       <- Unit tests

Installation

You can install environment by anaconda or docker, and then download the dataset.

Install with anaconda

# clone project
git clone https://github.com/ssnowyu/SMRB
cd SMRB

# create conda environment
conda create -n myenv python=3.8
conda activate myenv

# install requirements
pip install -r requirements.txt

Install with docker

You will need to install Nvidia Container Toolkit to enable GPU support.

# clone project
git clone https://github.com/ssnowyu/SMRB
cd SMRB

# build the container
docker build -t <project_name> .

# mount the project to the container
docker run -v $(pwd):/workspace/project --gpus all -it --rm <project_name>

Download the dataset

Download dataset from here, unzip it and move it to the data directory in the project.

Quickstart

  1. Implement the DL-based model as the LightningModule class. Details refer to Model Implementation. Here the MLP model (pre-configured in our project) is used as an example.

  2. Write a configuration file called simple_model for your model.

     _target_: src.models.mlp.MLP
    
     data_dir: ${data_dir}/api_mashup
     api_embed_path: embeddings/partial/text_bert_api_embeddings.npy
     mashup_embed_channels: 300
     mlp_output_channels: 300
     lr: 0.001
     weight_decay: 0.00001
    
  3. Write a configuration file called mlp for your experiment.

     # @package _global_
    
     defaults:
         - override /trainer: default.yaml              # use default settings for trainer
         - override /model: mlp.yaml                    # use "mlp" as model
         - override /datamodule: partial_text_bert.yaml # use partial text-based embeddings encoded by BERT
         - override /callbacks: wandb.yaml              # use wandb as the callbacks
         - override /logger: wandb.yaml                 # use wandb as the log framework
    
     seed: 12345
    
     logger:
         wandb:
             name: 'MLP-partial-BERT'
             tags: ['partial', 'BERT']
    
     # Override model parameters
     model:
         api_embed_path: embeddings/partial/text_bert_api_embeddings.npy
         mashup_embed_channels: 768
         mlp_output_channels: 300
         lr: 0.001
    
  4. Since the project uses wandb as the log framework by default, you will need to have a wandb account and bind the account to the project by executing the following command.

    wandb login
    

    This command needs to be executed only once during the entire development process.

    If you do not want to use wandb, you can also choose another log framework. Please refer to LightningModule for how to change it.

  5. run the project.

    python train.py experiment=mlp/partial_bert
    

Guide

Choose dataset

We provides 8 datasets, as shown in the following table:

name form pre-trained model amount
partial_text_bert text-based BERT partial
partial_text_glove text-based GloVe partial
partial_word_bert word-based BERT partial
partial_word_glove word-based GloVe partial
total_text_bert text-based BERT total
total_text_glove text-based GloVe total
total_word_bert word-based BERT total
total_word_glove word-based GloVe total

Two forms

  • $(72 \times d)$. The original form processed by word embedding model, and each word corresponds to a vector whose size is $(1 \times d)$. This form is suitable for word-based representation.
  • $(1 \times d)$. Pooling the representation whose size is $(72 \times d)$ by averaging to a representation whose size is $(1 \times d)$, with which the whole text is represented. This form is suitable for text-based representation.

Two amount

  • total: 21495 APIs, including some unused APIs.
  • partial: 932 APIs, all of which have been used at least once.

Two pre-trained models

  • BERT: A global log-bilinear regression model for unsupervised learning of word representations.
  • GloVe: Bidirectional Encoder Representations from Transformer.

Model implementation

pic

You need to implement the DL-based model as the LightningModule class. What you need to do are:

  1. return the loss in training_step to guide model training.
  2. return the predicted result preds and the true result targets in test_step. Then, the callback mechanism will capture them and calculates the performance on each metric.

Experiment configuration

Organizing experiments by combining existing elements. Take simple-model as an example. pic

Pre-configured models

  1. MTFM: This method uses GloVe as the pre-training model and extracts the semantic features in the description of the mashup by CNN. Then, the obtained features are processed by a feature interaction component to extract interaction features with Web API embedding vectors. Finally, the semantic and interaction features to predict the candidate Web APIs for the mashup description.
  2. coACN: This method classifies the Web APIs using category information to obtain the service domain and embeds the service domain into the mashup embedding vector by a domain-level attention unit. And then constructs a service combination graph with the invocation relationship between mashups and Web APIs and extracts collaborative relationships using LightGCN. Finally, the probability of invocation between mashup and API is predicted by MLP.
  3. MISR: This method uses CNN to extract semantic features in the description, node2vec to obtain implicit neighbor interaction features from the mashup-Service Invocation Matrix, and MLP to obtain direct neighbor interaction features from the mashup-Service Invocation Vector. Finally, the above three features are fused, and the results are predicted using MLP. It is worth noting that the three feature extraction modules can be trained separately and then combined with fine-tuning on the real data to obtain the final complete model. In this case, we remove the Neighbor Interaction Component of MISR to avoid potential data leaking caused by node2vec encoding based on all invocations between mashups and APIs.
  4. FISR: The model consists of three layers. The feature Extraction Layer gets the feature representation of the target mashup, selected APIs, and the following API to be recommended based on the text description of mashups and APIs. The interaction Layer learns the relevance weight between each selected and candidate service using an attention mechanism. Output Layer calculates the probability of recommending the candidate service to the target mashup in the next round.
  5. T2L2: This method consists of only three linear layers. The first two linear layers align the representation space of mashups and Web APIs by bridging the semantic gap between mashups and Web APIs. The third linear layer calculates the matching scores of mashups and Web APIs. The second of these linear layers is part of the propagation component, which is used to incorporate mashup information into the representation vectors of the Web APIs.
  6. T2L2-W/O-Propagation: This method is based on a modification of T2L2, which requires the training data to be ordered chronologically. Since our training data is disordered, the propagation component of T2L2 may propagate inaccurate information into the APIs' representation vector, affecting the model's outcomes. Therefore, we remove the propagation component from T2L2 and obtain T2L2-W/O-Propagation.
  7. MLP: A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN), which consists of some fully connected operation and activation functions. It is a very simple and commonly used neural network structure. We created the model as an MLP for comparison with other models. It processes the feature for the mashup and the Web API with an MLP with two linear layers to predict the matching scores.
  8. Freq: This method always recommends the top $N$ frequently invoked Web APIs.

smrb's People

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.