thunlp-mt / mask-align Goto Github PK

View Code? Open in Web Editor NEW

58.0 3.0 20.0 2.38 MB

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

License: BSD 3-Clause "New" or "Revised" License

Python 98.90% Shell 1.10%

machine-translation self-supervised-learning word-alignment

mask-align's People

Contributors

Stargazers

Watchers

Forkers

wangclnlp lanwuwei trendingtechnology tjudoubi lizezhonglaile sgoycoechea mingmingyang ngoctanle shinjinighosh techthiyanes alexshypula visionshao whatyouknow123 nampdn whuhxb ishine kirinmin

mask-align's Issues

Training issure

There is an error during the training when I used my training data. However, the training steps didn't stopped. Do you know what it's going on with it?

Issue with namespace using train.sh

Hi,

I'm trying to run the training script with Python 3.8.10 and torch==1.10.2+cu113, and I obtain the following error:

>> bash thualign/bin/train.sh -s mask_align -e agree_deen
running mask_align
Traceback (most recent call last):
  File "/net/aistaff/sarti/Mask-Align/thualign/bin/trainer.py", line 21, in <module>
    import thualign.data as data
  File "/net/aistaff/sarti/Mask-Align/thualign/data/__init__.py", line 5, in <module>
    from thualign.data.dataset import Dataset, TextLineDataset
  File "/net/aistaff/sarti/Mask-Align/thualign/data/dataset.py", line 51, in <module>
    class Dataset(IterableDataset):
  File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 273, in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
  File "/usr/lib/python3.8/abc.py", line 85, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 373, in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
TypeError: Expected 'Iterator' as the return annotation for `__iter__` of Dataset, but found thualign.data.iterator.Iterator

Do you have a specific pinned version of torch to make the script work?

How to generate alignment based on token level?

I can use your command to generate alignment based on bpe level. But how to generate alignment based on token level?

Cuda issues

I'm wondering the environment type used in this project.
I run this project with default config on 4 V100. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.

> bash thualign/bin/train.sh -s thualign/configs/user/example.config

2022-03-22 17:53:44.193 -- Process 3 terminated with the following error:
2022-03-22 17:53:44.193 Traceback (most recent call last):
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 17:53:44.193 fn(i, *args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 17:53:44.193 main(local_args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 17:53:44.193 loss, log_info = model(features)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 17:53:44.193 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 17:53:44.193 loss = self.criterion(net_output, labels)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 40, in forward
2022-03-22 17:53:44.193 sum_probs = torch.sum(log_probs.to(torch.float32), dim=-1)
2022-03-22 17:53:44.193 RuntimeError: CUDA out of memory. Tried to allocate 13.96 GiB (GPU 3; 31.75 GiB total capacity; 22.23 GiB already allocated; 7.34 GiB free; 23.07 GiB reserved in total by PyTorch)

Later I reduce the batch size and in the middle of the train, my process will be terminated.

2022-03-22 20:22:44.202 Traceback (most recent call last):
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 395, in <module>
2022-03-22 20:22:44.202 cli_main()
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 388, in cli_main
2022-03-22 20:22:44.202 torch.multiprocessing.spawn(process_fn, args=(parsed_args,),
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
2022-03-22 20:22:44.202 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
2022-03-22 20:22:44.202 while not context.join():
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
2022-03-22 20:22:44.202 raise Exception(
2022-03-22 20:22:44.202 Exception: process 0 terminated with signal SIGSEGV

chinese to english result issue

I have problem with reproducing the chinese to english result in paper (13.8%) .The best result I did is (14.6%)
I use the LDC dataset ,do BPE to the train.ch and train.en ,then do suff and then clean the single word sentence .
Using Chinese-English evaluation set for test and dev, I preprocess chinese with BPE , and do nothing with English.
And I use the example config,I modify the path and set batchsize to 5000 and updatacycle to 2.
I would be grateful,if you can help me to reproduce the result

Operation about bpe

Did you apply bpe in your train data? What do you mean by "We used a joint source and target Byte Pair Encoding (BPE) (Sennrich et al., 2016) with 40k merge operations." in your artical sector 3.1?

question about bpe and inference OOM

I'm confused when to use BPE:
In preprocess , there said the valid and test‘s tgt needn't bpe and src need bpe.
But in example.config , the test both src and tgt applied bpe.
I want to know when and which file in train, valid, test, src or tgt should use bpe, and which needn't to apply bpe.

Annother question, when I inference with the subset of training data, batch_size is setted to 1, the GPU is always OOM. But when train model use full training data, batch_size is 4000, the GPU isn't OOM. I am curious about how to solve it or use only CPU to inference?

Thanks for any helpful suggestions.

issue on Ro-en reproduce

I get Ro-en data form 'https://github.com/lilt/alignment-scripts/tree/master/preprocess'.
And the origninal train set was split into train set and valid set.
Then I maked joined bpe with 40k merges, and shuf and clean the sentences with length of 1 in train set.
Using 36K token batchsize and settings in your paper.
I got the same result of ch-en en-de en-fr,but I got 20.4(19.5 in your paper ) on Ro-en.
Is there anything wrong ?

Many thanks for your reply.

Result question

What does "9467" mean in the final test result: alignment-soft.txt: 14.4% (87.7%/83.5%/9467)?

issue on analyzed in your paper

I notice the "Predict and Alignment" part in your paper.
You divided tokens into four categories :cPcA wPcA cPwA wPwA。
Can you explain how to calculated them ?

Visulizatin Issure

I set eval_plot = True in my config file(example.config), but there is no images in the tensorboard when I finished the training step.

TypeError: Can't instantiate abstract class MapDataset with abstract methods _inputs, set_inputs

When i run bash thualign/bin/train.sh -s thualign/configs/user/example.config
I have got this error message:

Traceback (most recent call last):
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 390, in <module>
    cli_main()
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 384, in cli_main
    nprocs=world_size)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 363, in process_fn
    main(local_args)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 280, in main
    dataset = data.AlignmentPipeline.get_train_dataset(params.train_input, params)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/pipeline.py", line 351, in get_train_dataset
    dataset = dataset.map(map_obj)
  File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/dataset.py", line 82, in map
    return MapDataset(self, fn)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1223, in __new__
    return _generic_new(cls.__next_in_mro__, cls, *args, **kwds)
  File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1184, in _generic_new
    return base_cls.__new__(cls)
TypeError: Can't instantiate abstract class MapDataset with abstract methods _inputs, set_inputs`

My python version is 3.6. torch version is 1.8.1
Do you have any solution? Thanks!

Torch issues

I'm wondering the environment type used in this project. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.

2022-03-22 18:49:37.663 -- Process 2 terminated with the following error:
2022-03-22 18:49:37.663 Traceback (most recent call last):
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 18:49:37.663 fn(i, *args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 18:49:37.663 main(local_args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 18:49:37.663 loss, log_info = model(features)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 18:49:37.663 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 18:49:37.663 loss = self.criterion(net_output, labels)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 26, in forward
2022-03-22 18:49:37.663 loss = log_probs[batch_idx, labels]
2022-03-22 18:49:37.663 IndexError: tensors used as indices must be long, byte or bool tensors

2022-03-22 16:27:17.919 -- Process 2 terminated with the following error:
2022-03-22 16:27:17.919 Traceback (most recent call last):
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 16:27:17.919 fn(i, *args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 16:27:17.919 main(local_args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 16:27:17.919 loss, log_info = model(features)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 16:27:17.919 result = self.forward(*input, **kwargs)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 53, in forward
2022-03-22 16:27:17.919 f_state = self.f_model.encode(features, f_state)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 108, in encode
2022-03-22 16:27:17.919 inputs = torch.nn.functional.embedding(src_seq, self.src_embedding)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
2022-03-22 16:27:17.919 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2022-03-22 16:27:17.919 RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

I modify them by adding .long() and it works.

Model cannot converge

I try to train a mask_align model with default config in the repo (only change data paths) and DE-EN training data from https://github.com/lilt/alignment-scripts. In some of training steps the losses are nan and at end of training the loss increases from about 7 to 70.

epoch = 5, step = 49980, loss: nan, f_loss: nan, b_loss: nan, agree_loss: nan, entropy_loss: nan (0.246 sec)
epoch = 5, step = 49990, loss: 64.210, f_loss: 67.750, b_loss: 60.188, agree_loss: 0.000, entropy_loss: 0.241 (0.507 sec)
epoch = 5, step = 50000, loss: 69.115, f_loss: 72.500, b_loss: 65.312, agree_loss: 0.000, entropy_loss: 0.240 (0.652 sec)