Code Monkey home page Code Monkey logo

congen's Introduction

ConGen

Implementation of ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation (Finding of EMNLP 2022).

Citation

@inproceedings{limkonchotiwat-etal-2022-congen,
    title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
    author = "Limkonchotiwat, Peerat  and
      Ponwitayarat, Wuttikorn  and
      Lowphansirikul, Lalita and
      Udomcharoenchaikit, Can  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    year = "2022",
    publisher = "Association for Computational Linguistics",
}

Announcement (2023)

  • We have a new version of ConGen: SCT (published at TACL2023).
  • The SCT method outperforms ConGen on distillation settings.
  • This method is also effective for a small model to learn sentence embedding without the teacher model!

Installation

git clone https://github.com/KornWtp/ConGen.git
cd ConGen
pip install -e .

Our models (Small to Large)

Usage

Training data

We use the training data from BSL's paper: monolingual version and multilingual version.

Development data

We use sts-b development set from sentence transformer.

Parameters

The full model parameters:

Models Teacher Temp Student Temp Queue Size Learning Rate
BERT-Tiny 0.05 0.05 16384 5e-4
BERT-Mini 0.05 0.07 16384 3e-4
Tiny-BERT-L4 0.05 0.05 65536 1e-4
MiniLM-L3 0.05 0.07 16384 5e-4
MiniLM-L6 0.05 0.07 65536 3e-4
BERT-Small 0.05 0.07 65536 3e-4
MiniLM-L12 0.05 0.07 16384 5e-5
Tiny-BERT-L6 0.05 0.07 65536 5e-5
BERT-base 0.05 0.07 65536 5e-5
RoBERTa-base 0.1 0.1 1024 5e-5
Multilingual-DistilBERT 0.05 0.07 65536 3e-4
Multilingual-MiniLM-L12 0.05 0.07 65536 3e-4

Train your own model

Please set the model's parameter before training.

>> bash train_congen.sh

For finetuning model parameters:

learning_rate_all=(3e-4 5e-4 1e-4 3e-5 5e-5 1e-5)
queue_sizes=(262144 131072 65536 16384 1024)
teacher_temps=(0.01 0.03 0.05 0.07 0.09 0.1)
student_temps=(0.01 0.03 0.05 0.07 0.09 0.1)

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval and SimCSE.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Evaluation - Notebook

Please see https://github.com/KornWtp/ConGen/tree/main/notebook

Evaluation - Python

Then come back to the root directory, you can evaluate any sentence transformers models using SimCSE evaluation code. For example,

python evaluation.py \
    --model_name_or_path "your-model-path" \
    --task_set sts \
    --mode test

Main results - STS

In our paper, we average score over three models and shown as follows:

Methods Semantic Textual Similarity (STS) average scores
BERT
Tiny
BERT
Mini
Tiny
BERT-L4
MiniLM
L3
MiniLM
L6
BERT
Small
MiniLM
L12
Tiny
BERT-L6
BERT
Base
RoBERTa
Base
#Param (M) 4 11 14 17 22 29 33 67 109 125
Finetuning-based
Teacher SimCSE-Unsup-RoBERTa-large: 78.90
Sup-SimCSE 72.35 76.52 78.19 76.49 78.86 78.59 80.48 81.23 81.57 82.52
Unsup-SimCSE 64.47 65.94 67.91 55.10 59.15 69.13 67.90 73.67 76.25 77.10
Distillation-based
L2 73.32 76.07 77.03 76.66 77.51 77.30 78.79 78.95 78.97 79.00
Making 70.76 74.42 76.39 75.34 74.74 76.92 76.91 78.67 78.07 79.06
SKD 68.83 72.02 73.05 72.66 73.59 75.06 74.58 77.62 78.05 77.44
CKD 76.19 76.59 77.48 77.14 77.90 76.97 77.92 78.29 78.54 78.34
Our propose method
ConGen 76.85 78.09 78.54 78.22 79.10 78.91 79.68 79.73 80.06 79.78

Full results

Models STS-12 STS-13 STS-14 STS-15 STS-16 STS-B SICK-R Avg.
BERT-Tiny 72.18 81.12 75.45 83.22 77.89 79.03 69.05 76.85
BERT-Mini 74.17 82.69 76.58 84.30 78.23 80.84 69.82 78.09
Tiny-BERT-L4 74.3 83.07 77.37 84.70 79.06 80.99 70.26 78.54
MiniLM-L3 74.00 82.93 76.58 84.35 78.57 81.00 70.09 78.22
MiniLM-L6 75.06 83.86 77.29 85.01 79.67 81.92 70.89 79.10
BERT-Small 74.50 83.58 77.29 84.83 79.72 81.93 70.55 78.91
MiniLM-L12 75.25 84.61 78.27 85.51 80.52 82.32 71.32 79.68
Tiny-BERT-L6 75.53 84.76 78.33 85.72 80.42 82.25 71.12 79.73
BERT-base 75.58 85.13 78.54 85.75 81.12 82.81 71.47 80.06
RoBERTa-base 75.32 84.56 77.26 85.33 81.34 82.67 72.00 79.78

We have Thai sentence embedding models from ConGen!!

Hyper-Parameters

Parameters Models Teacher Temp Student Temp Queue Size Learning Rate
<30M ConGen-WangchanBERT-Tiny 0.01 0.01 65536 3e-4
ConGen-WangchanBERT-Small 0.05 0.09 65536 5e-4
>100M ConGen-simcse-model-roberta-base-thai 0.05 0.03 65536 3e-4
ConGen-paraphrase-multilingual-mpnet-base-v2 0.05 0.05 262144 1e-4

Thai semantic textual similarity benchmark

Parameters Models Spearman's Correlation (*100)
<30M ConGen-WangchanBERT-Tiny 66.43
ConGen-WangchanBERT-Small 70.65
>100M ConGen-simcse-model-roberta-base-thai 66.21
ConGen-paraphrase-multilingual-mpnet-base-v2 76.56

Thai transfer benchmark

Wisesight

Parameters Models Acc (*100) F1 (*100, weighted)
<30M ConGen-WangchanBERT-Tiny 61.55 62.19
ConGen-WangchanBERT-Small 64.77 65.30
>100M ConGen-simcse-model-roberta-base-thai 65.07 65.28
ConGen-paraphrase-multilingual-mpnet-base-v2 67.84 68.31

Wongnai

Parameters Models Acc (*100) F1 (*100, weighted)
<30M ConGen-WangchanBERT-Tiny 42.67 44.78
ConGen-WangchanBERT-Small 43.38 45.99
>100M ConGen-simcse-model-roberta-base-thai 41.32 41.57
ConGen-paraphrase-multilingual-mpnet-base-v2 47.22 48.63

Generated Review

Parameters Models Acc (*100) F1 (*100, weighted)
<30M ConGen-WangchanBERT-Tiny 54.26 52.69
ConGen-WangchanBERT-Small 58.22 57.03
>100M ConGen-simcse-model-roberta-base-thai 49.81 47.94
ConGen-paraphrase-multilingual-mpnet-base-v2 58.00 56.80

congen's People

Contributors

kornwtp avatar mrpeerat avatar

Stargazers

 avatar Iftitahu Ni'mah avatar Raymond avatar Sakolkrit Pengkhum avatar  avatar Potsawee avatar Jeff Carpenter avatar Wilson Wongso avatar takipipo avatar Guillaume FORTAINE avatar  avatar Lalita Lowphansirikul avatar Nonpavit Detbun avatar Chonlasit Ketkaew avatar  avatar FC4b avatar Bancherd avatar Yotam avatar  avatar  avatar

Watchers

Lalita Lowphansirikul avatar Bancherd avatar Yotam avatar  avatar

congen's Issues

What is the license?

Hello! Thank you for awesome project and research! I want to know the license of the project?

Is there any way to increase max_length

Thank you for such a great repo and the publication. I have a question about a way to increase max_length of 128. According to my use case, most of the cleaned and tokenized sentences are usually longer than that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.