dirkiedai / sk-mt Goto Github PK
View Code? Open in Web Editor NEWThis is the official code for our paper "Simple and Scalable Nearest Neighbor Machine Translation" (ICLR 2023).
This is the official code for our paper "Simple and Scalable Nearest Neighbor Machine Translation" (ICLR 2023).
Hi, is it convenient to complete this part of the data?
Thanks for your nice work! I am trying to reproduce results of multi-domain datasets.
However, the results i get are quite different from the reported results in the paper.
When I followed your guidance and started reproducing the experiment from text pre-retrieval, the final result using the retrieval-processed test_tm could not reach the reproducible result using the test_tm you gave. The hyperparameters are consistent with your paper. The following are my results. Except for the koran field, the results in other fields seriously fall short of the results of the paper :
koran: 19.52 paper: 18.9
it: 41.91 paper: 43.9
medical: 51.24 paper: 55.2
law: 55.50 paper: 61.6
The following is my script code for reproducing the law field: Please check if there is any problem. Among them, I use pytorch=1.12.0, python=3.8, numpy=1.23.0, elasticseach=7.0.0,faiss-gpu=1.7.3
And the bpe processing of my data and fairseq binary processing are also consistent with the log information of the data you provided.
1.Retrieval
PROJECT_PATH=/home/npc/sk-mt-fairseq
domain=law
type=test
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain
for split in train dev test
do
paste -d '\t' $DATA_PATH/$split.bpe.de $DATA_PATH/$split.bpe.en > $DATA_PATH/$split.txt
done
python $PROJECT_PATH/bm25_retrieval.py \
--build_index --search_index \
--index_file $DATA_PATH/train.txt \
--search_file $DATA_PATH/$type.txt \
--output_file $DATA_PATH/$domain.$type \
--index_name $domain --topk 64\
--task domain_adaptation
2.process
PROJECT_PATH="/home/npc/sk-mt-fairseq"
domain=law
DATA_PATH="/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain"
tmp_dir="/home/npc/sk-mt-fairseq/process-data/tmp_dir/$domain/$type"
type=test
max_t=64
python $PROJECT_PATH/data_clean.py \
--input $DATA_PATH/$domain.$type \
--output $tmp_dir --subset $type \
--max-t $max_t\
--task translation
DEST_PATH="/home/npc/sk-mt-fairseq/process-data/data-bin/$domain/test_tm"
DICT_PATH="/home/npc/base-model"
for i in $(seq 1 $max_t)
do
if [ $type == 'dev' ]
then
fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
else
fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
fi
done
3.Inference with SK-MT
MODEL_PATH=/home/npc/base-model/wmt19.de-en.ffn8192.pt
domain=law
OUTPUT_PATH=/home/npc/sk-mt-fairseq/output
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/data-bin/$domain
#DATA_PATH=/home/npc/sk-mt-fairseq/binarized_data/$domain
mkdir -p "$OUTPUT_PATH"
CUDA_VISIBLE_DEVICES=0 python3 experimental_generate.py $DATA_PATH \
--gen-subset test \
--path $MODEL_PATH --arch transformer_wmt19_de_en_with_datastore \
--task translation_tm \
--beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang de --target-lang en \
--scoring sacrebleu \
--batch-size 16 \
--tm-counts 16 \
--fp16 \
--tokenizer moses --remove-bpe \
--model-overrides "{'load_knn_datastore': False, 'use_knn_datastore': True, 'dstore_fp16': True, 'k': 1, 'probe': 32,
'knn_sim_func': 'do_not_recomp_l2', 'use_gpu_to_search': True, 'move_dstore_to_mem': True, 'no_load_keys': True,
'knn_temperature_type': 'fix', 'knn_temperature_value': 100, 'knn_lambda_temperature_value': 100,
}" \
| tee "$OUTPUT_PATH"/generate_$domain.txt
Could you please help me find out what is the problem? Looking forward to your reply.
Thanks for your great code!
However, I found something wrong with the binarized data you provided for fairseq.
According to the preprocess.log, the binarized data appears to be a preprocessed wikitext-103 dataset for LM task.
[None] Dictionary: 267743 types [None] /apdcephfs/share_916081/dirkiedai/datasets/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by <unk> Wrote preprocessed data to /apdcephfs/share_916081/dirkiedai/data-bin/wikitext-103
I tried to do the preprocess myself but failed to parse the required text data format from the code.
Could you please re-upload the correct dataset or release the script for fairseq preprocessing?
Looking forward to your reply.
Thanks for your great code!
Hi,When I am running processing data -related code,python $PROJECT_PATH/data_clean.py
,--output $tmp_dir
,Did not generate the "test.de" or "test.en" required for the processing of FAIRSEQ for processing.
My data processing script is as follows:
`PROJECT_PATH=/home/npc/sk-mt-fairseq
DATA_PATH=/home/npc/sk-mt-fairseq/data_bpe/$domain
tmp_dir=/home/npc/sk-mt-fairseq/data_bpe/$domain/tm-dir
domain=koran
type=test
max_t=64
python $PROJECT_PATH/data_clean.py
--input $DATA_PATH/$domain.$type
--output $tmp_dir --subset $type
--max-t $max_t
DEST_PATH=/home/npc/sk-mt-fairseq/data_bpe/data-bin/koran/test_tm
DICT_PATH=/home/npc/wmt19-model
for i in $(seq 1 $max_t)
do
if [ $type == 'dev' ]
then
fairseq-preprocess --validpref
else
fairseq-preprocess --testpref
fi
done`
Run python $PROJECT_PATH/data_clean.py
No error, but there is no output.I'm sure my input file ‘koran.test’ exists and there is no proble.
How should I solve?Looking forward to your reply.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.