Code Monkey home page Code Monkey logo

sk-mt's People

Contributors

dirkiedai avatar zrustc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

zth9730 vhientran

sk-mt's Issues

Fail to Reproduce Results of Multi-Domain Dataset

Thanks for your nice work! I am trying to reproduce results of multi-domain datasets.
However, the results i get are quite different from the reported results in the paper.
When I followed your guidance and started reproducing the experiment from text pre-retrieval, the final result using the retrieval-processed test_tm could not reach the reproducible result using the test_tm you gave. The hyperparameters are consistent with your paper. The following are my results. Except for the koran field, the results in other fields seriously fall short of the results of the paper :

koran: 19.52       paper: 18.9
it: 41.91              paper: 43.9
medical: 51.24    paper: 55.2
law: 55.50           paper: 61.6

The following is my script code for reproducing the law field: Please check if there is any problem. Among them, I use pytorch=1.12.0, python=3.8, numpy=1.23.0, elasticseach=7.0.0,faiss-gpu=1.7.3
And the bpe processing of my data and fairseq binary processing are also consistent with the log information of the data you provided.
1.Retrieval

PROJECT_PATH=/home/npc/sk-mt-fairseq
domain=law
type=test
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain

for split in train dev test
do
    paste -d '\t' $DATA_PATH/$split.bpe.de $DATA_PATH/$split.bpe.en > $DATA_PATH/$split.txt
done

python $PROJECT_PATH/bm25_retrieval.py \
    --build_index --search_index \
    --index_file $DATA_PATH/train.txt \
    --search_file $DATA_PATH/$type.txt \
    --output_file $DATA_PATH/$domain.$type \
    --index_name $domain --topk 64\
    --task domain_adaptation

2.process

PROJECT_PATH="/home/npc/sk-mt-fairseq"
domain=law
DATA_PATH="/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain"
tmp_dir="/home/npc/sk-mt-fairseq/process-data/tmp_dir/$domain/$type"
type=test
max_t=64

python $PROJECT_PATH/data_clean.py \
        --input $DATA_PATH/$domain.$type \
        --output $tmp_dir --subset $type \
        --max-t $max_t\
        --task translation


DEST_PATH="/home/npc/sk-mt-fairseq/process-data/data-bin/$domain/test_tm"
DICT_PATH="/home/npc/base-model"
for i in $(seq 1 $max_t)
do
    if [ $type == 'dev' ]
    then
        fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    else
        fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    fi
done

3.Inference with SK-MT

MODEL_PATH=/home/npc/base-model/wmt19.de-en.ffn8192.pt
domain=law
OUTPUT_PATH=/home/npc/sk-mt-fairseq/output
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/data-bin/$domain
#DATA_PATH=/home/npc/sk-mt-fairseq/binarized_data/$domain
 mkdir -p "$OUTPUT_PATH"

CUDA_VISIBLE_DEVICES=0 python3 experimental_generate.py $DATA_PATH \
    --gen-subset test \
    --path $MODEL_PATH --arch transformer_wmt19_de_en_with_datastore \
    --task translation_tm \
    --beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang de --target-lang en \
    --scoring sacrebleu \
    --batch-size 16 \
    --tm-counts 16 \
    --fp16 \
    --tokenizer moses --remove-bpe \
    --model-overrides "{'load_knn_datastore': False, 'use_knn_datastore': True, 'dstore_fp16': True, 'k': 1, 'probe': 32,
    'knn_sim_func': 'do_not_recomp_l2', 'use_gpu_to_search': True, 'move_dstore_to_mem': True, 'no_load_keys': True,
    'knn_temperature_type': 'fix', 'knn_temperature_value': 100, 'knn_lambda_temperature_value': 100,
     }" \
    | tee "$OUTPUT_PATH"/generate_$domain.txt

Could you please help me find out what is the problem? Looking forward to your reply.

Wrong binarized data for fairseq

Thanks for your great code!
However, I found something wrong with the binarized data you provided for fairseq.
According to the preprocess.log, the binarized data appears to be a preprocessed wikitext-103 dataset for LM task.

[None] Dictionary: 267743 types [None] /apdcephfs/share_916081/dirkiedai/datasets/wikitext-103/wiki.test.tokens: 4358 sents, 245569 tokens, 0.0% replaced by <unk> Wrote preprocessed data to /apdcephfs/share_916081/dirkiedai/data-bin/wikitext-103

I tried to do the preprocess myself but failed to parse the required text data format from the code.
Could you please re-upload the correct dataset or release the script for fairseq preprocessing?
Looking forward to your reply.

Process does not generate test{1-64}.de or test{1-64}.en file

Thanks for your great code!
Hi,When I am running processing data -related code,python $PROJECT_PATH/data_clean.py--output $tmp_dir,Did not generate the "test.de" or "test.en" required for the processing of FAIRSEQ for processing.
My data processing script is as follows:
`PROJECT_PATH=/home/npc/sk-mt-fairseq
DATA_PATH=/home/npc/sk-mt-fairseq/data_bpe/$domain
tmp_dir=/home/npc/sk-mt-fairseq/data_bpe/$domain/tm-dir
domain=koran
type=test
max_t=64

python $PROJECT_PATH/data_clean.py
--input $DATA_PATH/$domain.$type
--output $tmp_dir --subset $type
--max-t $max_t

DEST_PATH=/home/npc/sk-mt-fairseq/data_bpe/data-bin/koran/test_tm
DICT_PATH=/home/npc/wmt19-model
for i in $(seq 1 $max_t)
do
if [ $type == 'dev' ]
then
fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
else
fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
fi
done`

Run python $PROJECT_PATH/data_clean.pyNo error, but there is no output.I'm sure my input file ‘koran.test’ exists and there is no proble.
How should I solve?Looking forward to your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.