ConGen

Implementation of ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation (Finding of EMNLP 2022).

Citation

@inproceedings{limkonchotiwat-etal-2022-congen,
    title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation",
    author = "Limkonchotiwat, Peerat  and
      Ponwitayarat, Wuttikorn  and
      Lowphansirikul, Lalita and
      Udomcharoenchaikit, Can  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    year = "2022",
    publisher = "Association for Computational Linguistics",
}

Announcement (2023)

We have a new version of ConGen: SCT (published at TACL2023).
The SCT method outperforms ConGen on distillation settings.
This method is also effective for a small model to learn sentence embedding without the teacher model!

Installation

git clone https://github.com/KornWtp/ConGen.git
cd ConGen
pip install -e .

Our models (Small to Large)

Usage

Training data

We use the training data from BSL's paper: monolingual version and multilingual version.

Development data

We use sts-b development set from sentence transformer.

Parameters

The full model parameters:

Models	Teacher Temp	Student Temp	Queue Size	Learning Rate
BERT-Tiny	0.05	0.05	16384	5e-4
BERT-Mini	0.05	0.07	16384	3e-4
Tiny-BERT-L4	0.05	0.05	65536	1e-4
MiniLM-L3	0.05	0.07	16384	5e-4
MiniLM-L6	0.05	0.07	65536	3e-4
BERT-Small	0.05	0.07	65536	3e-4
MiniLM-L12	0.05	0.07	16384	5e-5
Tiny-BERT-L6	0.05	0.07	65536	5e-5
BERT-base	0.05	0.07	65536	5e-5
RoBERTa-base	0.1	0.1	1024	5e-5
Multilingual-DistilBERT	0.05	0.07	65536	3e-4
Multilingual-MiniLM-L12	0.05	0.07	65536	3e-4

Train your own model

Please set the model's parameter before training.

>> bash train_congen.sh

For finetuning model parameters:

learning_rate_all=(3e-4 5e-4 1e-4 3e-5 5e-5 1e-5)
queue_sizes=(262144 131072 65536 16384 1024)
teacher_temps=(0.01 0.03 0.05 0.07 0.09 0.1)
student_temps=(0.01 0.03 0.05 0.07 0.09 0.1)

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval and SimCSE.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Evaluation - Notebook

Please see https://github.com/KornWtp/ConGen/tree/main/notebook

Evaluation - Python

Then come back to the root directory, you can evaluate any sentence transformers models using SimCSE evaluation code. For example,

python evaluation.py \
    --model_name_or_path "your-model-path" \
    --task_set sts \
    --mode test

Main results - STS

In our paper, we average score over three models and shown as follows:

Methods	Semantic Textual Similarity (STS) average scores
Methods	BERT Tiny	BERT Mini	Tiny BERT-L4	MiniLM L3	MiniLM L6	BERT Small	MiniLM L12	Tiny BERT-L6	BERT Base	RoBERTa Base
#Param (M)	4	11	14	17	22	29	33	67	109	125
Finetuning-based
Teacher	SimCSE-Unsup-RoBERTa-large: 78.90
Sup-SimCSE	72.35	76.52	78.19	76.49	78.86	78.59	80.48	81.23	81.57	82.52
Unsup-SimCSE	64.47	65.94	67.91	55.10	59.15	69.13	67.90	73.67	76.25	77.10
Distillation-based
L2	73.32	76.07	77.03	76.66	77.51	77.30	78.79	78.95	78.97	79.00
Making	70.76	74.42	76.39	75.34	74.74	76.92	76.91	78.67	78.07	79.06
SKD	68.83	72.02	73.05	72.66	73.59	75.06	74.58	77.62	78.05	77.44
CKD	76.19	76.59	77.48	77.14	77.90	76.97	77.92	78.29	78.54	78.34
Our propose method
ConGen	76.85	78.09	78.54	78.22	79.10	78.91	79.68	79.73	80.06	79.78

Full results

Models	STS-12	STS-13	STS-14	STS-15	STS-16	STS-B	SICK-R	Avg.
BERT-Tiny	72.18	81.12	75.45	83.22	77.89	79.03	69.05	76.85
BERT-Mini	74.17	82.69	76.58	84.30	78.23	80.84	69.82	78.09
Tiny-BERT-L4	74.3	83.07	77.37	84.70	79.06	80.99	70.26	78.54
MiniLM-L3	74.00	82.93	76.58	84.35	78.57	81.00	70.09	78.22
MiniLM-L6	75.06	83.86	77.29	85.01	79.67	81.92	70.89	79.10
BERT-Small	74.50	83.58	77.29	84.83	79.72	81.93	70.55	78.91
MiniLM-L12	75.25	84.61	78.27	85.51	80.52	82.32	71.32	79.68
Tiny-BERT-L6	75.53	84.76	78.33	85.72	80.42	82.25	71.12	79.73
BERT-base	75.58	85.13	78.54	85.75	81.12	82.81	71.47	80.06
RoBERTa-base	75.32	84.56	77.26	85.33	81.34	82.67	72.00	79.78

We have Thai sentence embedding models from ConGen!!

Unsupervised learning: ConGen-simcse-model-roberta-base-thai. Teacher model: simcse-model-roberta-base-thai. Student model: WangchanBERTa
Weakly supervised learning: ConGen-paraphrase-multilingual-mpnet-base-v2. Teacher model: paraphrase-multilingual-mpnet-base-v2. Student model: WangchanBERTa
Training data: we do backtranslatation from TH-to-EN-to-TH from scb_mt_enth_2020's model. The translation dataset: back translated machine translation of SCB
We evaluate on two task benchmark tasks, such as Thai semantic textual similarity benchmark and Thai transfer benchmark

Hyper-Parameters

Parameters	Models	Teacher Temp	Student Temp	Queue Size	Learning Rate
<30M	ConGen-WangchanBERT-Tiny	0.01	0.01	65536	3e-4
<30M	ConGen-WangchanBERT-Small	0.05	0.09	65536	5e-4
>100M	ConGen-simcse-model-roberta-base-thai	0.05	0.03	65536	3e-4
>100M	ConGen-paraphrase-multilingual-mpnet-base-v2	0.05	0.05	262144	1e-4

Thai semantic textual similarity benchmark

Parameters	Models	Spearman's Correlation (*100)
<30M	ConGen-WangchanBERT-Tiny	66.43
<30M	ConGen-WangchanBERT-Small	70.65
>100M	ConGen-simcse-model-roberta-base-thai	66.21
>100M	ConGen-paraphrase-multilingual-mpnet-base-v2	76.56

Thai transfer benchmark

Wisesight

Parameters	Models	Acc (*100)	F1 (*100, weighted)
<30M	ConGen-WangchanBERT-Tiny	61.55	62.19
<30M	ConGen-WangchanBERT-Small	64.77	65.30
>100M	ConGen-simcse-model-roberta-base-thai	65.07	65.28
>100M	ConGen-paraphrase-multilingual-mpnet-base-v2	67.84	68.31

Wongnai

Parameters	Models	Acc (*100)	F1 (*100, weighted)
<30M	ConGen-WangchanBERT-Tiny	42.67	44.78
<30M	ConGen-WangchanBERT-Small	43.38	45.99
>100M	ConGen-simcse-model-roberta-base-thai	41.32	41.57
>100M	ConGen-paraphrase-multilingual-mpnet-base-v2	47.22	48.63

Generated Review

Parameters	Models	Acc (*100)	F1 (*100, weighted)
<30M	ConGen-WangchanBERT-Tiny	54.26	52.69
<30M	ConGen-WangchanBERT-Small	58.22	57.03
>100M	ConGen-simcse-model-roberta-base-thai	49.81	47.94
>100M	ConGen-paraphrase-multilingual-mpnet-base-v2	58.00	56.80

kornwtp / congen Goto Github PK

congen's Introduction