dirkiedai / sk-mt Goto Github PK

View Code? Open in Web Editor NEW

14.0 2.0 2.0 186.48 MB

This is the official code for our paper "Simple and Scalable Nearest Neighbor Machine Translation" (ICLR 2023).

Shell 2.08% Perl 4.33% Python 93.59%

k-nearest-neighbors machine-translation

sk-mt's People

Contributors

Stargazers

Watchers

Forkers

zth9730 vhientran

sk-mt's Issues

Need help

Missing dev data in bucket 100/200/500/1000 for online learning

Hi, is it convenient to complete this part of the data?

Fail to Reproduce Results of Multi-Domain Dataset

Thanks for your nice work! I am trying to reproduce results of multi-domain datasets.
However, the results i get are quite different from the reported results in the paper.
When I followed your guidance and started reproducing the experiment from text pre-retrieval, the final result using the retrieval-processed test_tm could not reach the reproducible result using the test_tm you gave. The hyperparameters are consistent with your paper. The following are my results. Except for the koran field, the results in other fields seriously fall short of the results of the paper :

koran: 19.52       paper: 18.9
it: 41.91              paper: 43.9
medical: 51.24    paper: 55.2
law: 55.50           paper: 61.6

The following is my script code for reproducing the law field: Please check if there is any problem. Among them, I use pytorch=1.12.0, python=3.8, numpy=1.23.0, elasticseach=7.0.0，faiss-gpu=1.7.3
And the bpe processing of my data and fairseq binary processing are also consistent with the log information of the data you provided.
1.Retrieval

PROJECT_PATH=/home/npc/sk-mt-fairseq
domain=law
type=test
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain

for split in train dev test
do
    paste -d '\t' $DATA_PATH/$split.bpe.de $DATA_PATH/$split.bpe.en > $DATA_PATH/$split.txt
done

python $PROJECT_PATH/bm25_retrieval.py \
    --build_index --search_index \
    --index_file $DATA_PATH/train.txt \
    --search_file $DATA_PATH/$type.txt \
    --output_file $DATA_PATH/$domain.$type \
    --index_name $domain --topk 64\
    --task domain_adaptation

2.process

PROJECT_PATH="/home/npc/sk-mt-fairseq"
domain=law
DATA_PATH="/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain"
tmp_dir="/home/npc/sk-mt-fairseq/process-data/tmp_dir/$domain/$type"
type=test
max_t=64

python $PROJECT_PATH/data_clean.py \
        --input $DATA_PATH/$domain.$type \
        --output $tmp_dir --subset $type \
        --max-t $max_t\
        --task translation


DEST_PATH="/home/npc/sk-mt-fairseq/process-data/data-bin/$domain/test_tm"
DICT_PATH="/home/npc/base-model"
for i in $(seq 1 $max_t)
do
    if [ $type == 'dev' ]
    then
        fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    else
        fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    fi
done

3.Inference with SK-MT

MODEL_PATH=/home/npc/base-model/wmt19.de-en.ffn8192.pt
domain=law
OUTPUT_PATH=/home/npc/sk-mt-fairseq/output
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/data-bin/$domain
#DATA_PATH=/home/npc/sk-mt-fairseq/binarized_data/$domain
 mkdir -p "$OUTPUT_PATH"

CUDA_VISIBLE_DEVICES=0 python3 experimental_generate.py $DATA_PATH \
    --gen-subset test \
    --path $MODEL_PATH --arch transformer_wmt19_de_en_with_datastore \
    --task translation_tm \
    --beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang de --target-lang en \
    --scoring sacrebleu \
    --batch-size 16 \
    --tm-counts 16 \
    --fp16 \
    --tokenizer moses --remove-bpe \
    --model-overrides "{'load_knn_datastore': False, 'use_knn_datastore': True, 'dstore_fp16': True, 'k': 1, 'probe': 32,
    'knn_sim_func': 'do_not_recomp_l2', 'use_gpu_to_search': True, 'move_dstore_to_mem': True, 'no_load_keys': True,
    'knn_temperature_type': 'fix', 'knn_temperature_value': 100, 'knn_lambda_temperature_value': 100,
     }" \
    | tee "$OUTPUT_PATH"/generate_$domain.txt

Could you please help me find out what is the problem? Looking forward to your reply.

Wrong binarized data for fairseq

Thanks for your great code!
However, I found something wrong with the binarized data you provided for fairseq.
According to the preprocess.log, the binarized data appears to be a preprocessed wikitext-103 dataset for LM task.

[None] Dictionary: 267743 types [None] /apdcephfs/share_916081/dirkiedai/datasets/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by <unk> Wrote preprocessed data to /apdcephfs/share_916081/dirkiedai/data-bin/wikitext-103

I tried to do the preprocess myself but failed to parse the required text data format from the code.
Could you please re-upload the correct dataset or release the script for fairseq preprocessing?
Looking forward to your reply.

Process does not generate test{1-64}.de or test{1-64}.en file

Thanks for your great code!
Hi,When I am running processing data -related code,python $PROJECT_PATH/data_clean.py，--output $tmp_dir，Did not generate the "test.de" or "test.en" required for the processing of FAIRSEQ for processing.
My data processing script is as follows:
`PROJECT_PATH=/home/npc/sk-mt-fairseq
DATA_PATH=/home/npc/sk-mt-fairseq/data_bpe/$domain
tmp_dir=/home/npc/sk-mt-fairseq/data_bpe/$domain/tm-dir
domain=koran
type=test
max_t=64

python $PROJECT_PATH/data_clean.py
--input $DATA_PATH/$domain.$type
--output $tmp_dir --subset $type
--max-t $max_t

DEST_PATH=/home/npc/sk-mt-fairseq/data_bpe/data-bin/koran/test_tm
DICT_PATH=/home/npc/wmt19-model
for i in $(seq 1 $max_t)
do
if [ $type == 'dev' ]
then
fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
else
fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
fi
done`

Run python $PROJECT_PATH/data_clean.pyNo error, but there is no output.I'm sure my input file ‘koran.test’ exists and there is no proble.
How should I solve？Looking forward to your reply.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.