currently master is a a pre-release of v2.0 and not available via pypi

Quick Installation

COMET requires python 3.8 or above!

Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

To develop locally install run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
pip install poetry
poetry install

For development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en

Basic scoring command:

comet-score -s src.de -t hyp1.en -r ref.en

you can set the number of gpus using --gpus (0 to test on CPU).

Scoring multiple systems:

comet-score -s src.de -t hyp1.en hyp2.en -r ref.en

WMT test sets via SacreBLEU:

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

If you are only interested in a system-level score use the following command:

comet-score -s src.de -t hyp1.en -r ref.en --quiet --only_system

Reference-free evaluation:

comet-score -s src.de -t hyp1.en --model Unbabel/wmt20-comet-qe-da

Note: We are currently working on Licensing and releasing Unbabel/wmt22-cometkiwi-da but meanwhile that models is not available.

Comparing multiple systems:

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

Minimum Bayes Risk Decoding:

The MBR command allows you to rank translations and select the best one according to COMET metrics. For more details you can read our paper on Quality-Aware Decoding for Neural Machine Translation.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

If working with a very large candidate list you can use --rerank_top_k flag to prune the topK most promissing candidates according to a reference-free metric.

Example for a candidate list of 1000 samples:

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt --num_sample 1000 --rerank_top_k 100 --gpus 4 --qe_model Unbabel/wmt20-comet-qe-da

COMET Models:

To evaluate your translations, we suggest using one of two models:

Default model: Unbabel/wmt22-comet-da - This model uses a reference-based regression approach and is built on top of XLM-R. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 represents a perfect translation.
Upcoming model: Unbabel/wmt22-cometkiwi-da - This reference-free model uses a regression approach and is built on top of InfoXLM. It has been trained on direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Like the default model, it also provides scores ranging from 0 to 1.

For versions prior to 2.0, you can still use Unbabel/wmt20-comet-da, which is the primary metric, and Unbabel/wmt20-comet-qe-da for the respective reference-free version. You can find a list of all other models developed in previous versions on our MODELS page. For more information, please refer to the model licenses.

Languages Covered:

All the above mentioned models are build on top of XLM-R which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

Scoring within Python:

from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    },
    {
        "src": "Schulen und Kindergärten wurden eröffnet.",
        "mt": "Schools and kindergartens were open",
        "ref": "Schools and kindergartens opened"
    }
]
model_output = model.predict(data, batch_size=8, gpus=1)
print(model_output)

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

You can also upload your model to Hugging Face Hub. Use Unbabel/wmt22-comet-da as example. Then you can use your model directly from the hub.

unittest:

In order to run the toolkit tests you must run the following command:

coverage run --source=comet -m unittest discover
coverage report -m # Expected coverage 78%

Note: Testing on CPU takes a long time

Publications

If you use COMET please cite our work and don't forget to say which model you used!

mtresearcher / comet Goto Github PK

comet's Introduction

Quick Installation

Scoring MT outputs:

CLI Usage:

Reference-free evaluation:

Comparing multiple systems:

Minimum Bayes Risk Decoding:

COMET Models:

Languages Covered:

Scoring within Python:

Train your own Metric:

unittest:

Publications

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent