ghchen18 / cdalign Goto Github PK
View Code? Open in Web Editor NEWCode for AAAI 2021 paper "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance"
License: MIT License
Code for AAAI 2021 paper "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance"
License: MIT License
I am getting this error while trying to run extract_alignment.sh ..
Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 128, in <module> cli_main() File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 125, in cli_main main(args) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/generate_align.py", line 33, in main task.load_dataset(args.gen_subset) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 217, in load_dataset self.datasets[split] = load_langpair_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/tasks/translation.py", line 54, in load_langpair_dataset src_dataset = data_utils.load_indexed_dataset(prefix + src, src_dict, dataset_impl) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/data_utils.py", line 73, in load_indexed_dataset dataset = indexed_dataset.make_dataset( File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 60, in make_dataset return MMapIndexedDataset(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 448, in __init__ self._do_init(path) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 458, in _do_init self._index = self.Index(index_file_path(self._path)) File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 408, in __init__ self._dtype = dtypes[dtype_code] KeyError: 9 Exception ignored in: <function MMapIndexedDataset.Index.__del__ at 0x7f0ed7d15b80> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 423, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'Index' object has no attribute '_bin_buffer_mmap' Exception ignored in: <function MMapIndexedDataset.__del__ at 0x7f0ed7d190d0> Traceback (most recent call last): File "/data/yugaljain/translation-pipeline/deltalm_setup/unilm/deltalm/extras/cdalign/fairseq/data/indexed_dataset.py", line 465, in __del__ self._bin_buffer_mmap._mmap.close() AttributeError: 'MMapIndexedDataset' object has no attribute '_bin_buffer_mmap'
Looking forward to get response on this.. @ghchen18
Thanks
I found many "talp" in scripts/extract_alignment.sh. What's it? How can I get this?
When training EAM-Output model with the params of vanilla Transformer frozen, I got "'int' object has no attribute 'backward'
".
The scripts are sa follows:
echo "Start processing alignment data into fairseq data format"
python preprocess.py -s $src -t $tgt --dataset-impl lazy
--workers 8 --destdir $fseq --align-suffix align --joined-dictionary
--trainpref $fseq/bpe/train --validpref $fseq/bpe/valid
--srcdict
请问在cdalign/scripts/extract_phrase.py 中的greedy文件是什么,在运行中,报错缺乏greedy文件,但该参数为false参数~
Did the codes implement "Decoding with One-tomany Constraints"?
What is the function of this script, and is there any specific usage example?
First of all, Your work is very impressive. 😀
I encountered some problems during reproducing and would like to get some help.
① I got Keyerror when training the EAMOUT model on the preprocessed alignment dataset.
I used a NVIDIA-V100 and the environment settings are as follows:
fairseq 0.9.0
torch 1.11.0
Any idea about the error?
Traceback (most recent call last):
File "/cdaAlign/cdalign-main/train.py", line 337, in <module>
cli_main()
File "/cdaAlign/cdalign-main/train.py", line 333, in cli_main
main(args)
File "/cdaAlign/cdalign-main/train.py", line 93, in main
train(args, trainer, task, epoch_itr)
File "/cdaAlign/cdalign-main/train.py", line 132, in train
for i, samples in enumerate(progress, start=epoch_itr.iterations_in_epoch):
File "/cdaAlign/cdalign-main/fairseq/progress_bar.py", line 181, in __iter__
for i, obj in enumerate(self.iterable, start=self.offset):
File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 314, in __next__
chunk.append(next(self.itr))
File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 43, in __next__
return next(self.itr)
File "/cdaAlign/cdalign-main/fairseq/data/iterators.py", line 36, in __iter__
for x in self.iterable:
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/envs/cda-align/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/envs/cda-align/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/cdaAlign/cdalign-main/fairseq/data/language_pair_dataset.py", line 215, in __getitem__
example['alignment'] = self.align_dataset[index]
File "/cdaAlign/cdalign-main/fairseq/data/indexed_dataset.py", line 222, in __getitem__
ptx = self.cache_index[i]
KeyError: 2206735
② In the one-to-many decoding experiment, I tried to make the model to select the constraint candidate as mentioned in your paper, which I quote here.
the model runs another decoder forward pass and selects the constraint with the highest length-averaged log-probability as the target constraint.
Does it means to:
My implementations are simplified as follows. And I got a CSR score Lower than decoding without constraint. I would really appreciate your help with locating my mistake.
# lprobs is the original probability distribution of current step
cur_max_prob = lprobs[idx, :].max().clone()
# tgt_p_toks is a list of current target constraint candidate tokens
tmp_cons_prob = cur_max_prob * len(tgt_p_toks)
# append a candidate target constraint after the original hypothesis
# run another decoder forward pass after the new hypothesis
# and I got cur_lprobs, the new probability distribution
cur_score = (cur_lprobs.max().clone() + tmp_cons_prob + scores.view(bsz*beam_size, -1)[idx, step-1]) / int(step + len(tgt_p_toks) + 1)
Sorry to bother, I'am stuck in using extract_phrase.py to extract the constraints.
How can I using this scripts to extract the constrains from my own dataset, it seems like there always has some files lack. What other files should I need to provide except the source and target bpe text according to the arguments in extract_phrase.py.
It has stumped me for a long time, very appreciate if you can give me some detail of how to use it.
Hi, I am using your repository to train some experiments. I appreciate the documentation.
I had a question about this line in the scripts to extract constraints:
cdalign/scripts/extract_phrase.py
Line 184 in 74389f7
I wanted to confirm the behavior. So this means it will run with values word_num = {0,1,2}
. But does it make sense for max_src_len=0
in the call to phrase_extraction()
? That is, there's no way to extract a phrase of length 0. I checked the value of cons_dicts[0]
right before writing to file, and found that it is None
.
So is this an unintended behavior, and the loop should actually be for word_num in range(1,3+1)
? Or am I misunderstanding?
Hi @ghchen18 ,
Thanks for the code. Your work is very interesting.
I am trying to load a translation model trained on fairseq v0.10.0 which, as expected, is giving errors since your paper is trained on fairseq v0.9.
Is there any way to load the v0.10 model (since this version is in existence since Nov 2020) ?
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.