thunlp-mt / mask-align Goto Github PK
View Code? Open in Web Editor NEWCode for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021
License: BSD 3-Clause "New" or "Revised" License
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021
License: BSD 3-Clause "New" or "Revised" License
Hi,
I'm trying to run the training script with Python 3.8.10 and torch==1.10.2+cu113
, and I obtain the following error:
>> bash thualign/bin/train.sh -s mask_align -e agree_deen
running mask_align
Traceback (most recent call last):
File "/net/aistaff/sarti/Mask-Align/thualign/bin/trainer.py", line 21, in <module>
import thualign.data as data
File "/net/aistaff/sarti/Mask-Align/thualign/data/__init__.py", line 5, in <module>
from thualign.data.dataset import Dataset, TextLineDataset
File "/net/aistaff/sarti/Mask-Align/thualign/data/dataset.py", line 51, in <module>
class Dataset(IterableDataset):
File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 273, in __new__
return super().__new__(cls, name, bases, namespace, **kwargs) # type: ignore[call-overload]
File "/usr/lib/python3.8/abc.py", line 85, in __new__
cls = super().__new__(mcls, name, bases, namespace, **kwargs)
File "/net/aistaff/sarti/Mask-Align/venv/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 373, in _dp_init_subclass
raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
TypeError: Expected 'Iterator' as the return annotation for `__iter__` of Dataset, but found thualign.data.iterator.Iterator
Do you have a specific pinned version of torch to make the script work?
I can use your command to generate alignment based on bpe level. But how to generate alignment based on token level?
I'm wondering the environment type used in this project.
I run this project with default config on 4 V100. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.
> bash thualign/bin/train.sh -s thualign/configs/user/example.config
2022-03-22 17:53:44.193 -- Process 3 terminated with the following error:
2022-03-22 17:53:44.193 Traceback (most recent call last):
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 17:53:44.193 fn(i, *args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 17:53:44.193 main(local_args)
2022-03-22 17:53:44.193 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 17:53:44.193 loss, log_info = model(features)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 17:53:44.193 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 17:53:44.193 loss = self.criterion(net_output, labels)
2022-03-22 17:53:44.193 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 17:53:44.193 result = self.forward(*input, **kwargs)
2022-03-22 17:53:44.193 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 40, in forward
2022-03-22 17:53:44.193 sum_probs = torch.sum(log_probs.to(torch.float32), dim=-1)
2022-03-22 17:53:44.193 RuntimeError: CUDA out of memory. Tried to allocate 13.96 GiB (GPU 3; 31.75 GiB total capacity; 22.23 GiB already allocated; 7.34 GiB free; 23.07 GiB reserved in total by PyTorch)
Later I reduce the batch size and in the middle of the train, my process will be terminated.
2022-03-22 20:22:44.202 Traceback (most recent call last):
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 395, in <module>
2022-03-22 20:22:44.202 cli_main()
2022-03-22 20:22:44.202 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 388, in cli_main
2022-03-22 20:22:44.202 torch.multiprocessing.spawn(process_fn, args=(parsed_args,),
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
2022-03-22 20:22:44.202 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
2022-03-22 20:22:44.202 while not context.join():
2022-03-22 20:22:44.202 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
2022-03-22 20:22:44.202 raise Exception(
2022-03-22 20:22:44.202 Exception: process 0 terminated with signal SIGSEGV
I have problem with reproducing the chinese to english result in paper (13.8%) .The best result I did is (14.6%)
I use the LDC dataset ,do BPE to the train.ch and train.en ,then do suff and then clean the single word sentence .
Using Chinese-English evaluation set for test and dev, I preprocess chinese with BPE , and do nothing with English.
And I use the example config,I modify the path and set batchsize to 5000 and updatacycle to 2.
I would be grateful,if you can help me to reproduce the result
Did you apply bpe in your train data? What do you mean by "We used a joint source and target Byte Pair Encoding (BPE) (Sennrich et al., 2016) with 40k merge operations." in your artical sector 3.1?
I'm confused when to use BPE:
In preprocess , there said the valid and test‘s tgt needn't bpe and src need bpe.
But in example.config , the test both src and tgt applied bpe.
I want to know when and which file in train, valid, test, src or tgt should use bpe, and which needn't to apply bpe.
Annother question, when I inference with the subset of training data, batch_size is setted to 1, the GPU is always OOM. But when train model use full training data, batch_size is 4000, the GPU isn't OOM. I am curious about how to solve it or use only CPU to inference?
Thanks for any helpful suggestions.
I get Ro-en data form 'https://github.com/lilt/alignment-scripts/tree/master/preprocess'.
And the origninal train set was split into train set and valid set.
Then I maked joined bpe with 40k merges, and shuf and clean the sentences with length of 1 in train set.
Using 36K token batchsize and settings in your paper.
I got the same result of ch-en en-de en-fr,but I got 20.4(19.5 in your paper ) on Ro-en.
Is there anything wrong ?
Many thanks for your reply.
What does "9467" mean in the final test result: alignment-soft.txt: 14.4% (87.7%/83.5%/9467)?
I notice the "Predict and Alignment" part in your paper.
You divided tokens into four categories :cPcA wPcA cPwA wPwA。
Can you explain how to calculated them ?
When i run bash thualign/bin/train.sh -s thualign/configs/user/example.config
I have got this error message:
Traceback (most recent call last):
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 390, in <module>
cli_main()
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 384, in cli_main
nprocs=world_size)
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 363, in process_fn
main(local_args)
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/bin/trainer.py", line 280, in main
dataset = data.AlignmentPipeline.get_train_dataset(params.train_input, params)
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/pipeline.py", line 351, in get_train_dataset
dataset = dataset.map(map_obj)
File "/apdcephfs/share_47076/lisalai/code/WordAlignment/Mask-Align/thualign/data/dataset.py", line 82, in map
return MapDataset(self, fn)
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1223, in __new__
return _generic_new(cls.__next_in_mro__, cls, *args, **kwds)
File "/apdcephfs/share_47076/lisalai/anaconda3/lib/python3.6/typing.py", line 1184, in _generic_new
return base_cls.__new__(cls)
TypeError: Can't instantiate abstract class MapDataset with abstract methods _inputs, set_inputs`
My python version is 3.6. torch version is 1.8.1
Do you have any solution? Thanks!
I'm wondering the environment type used in this project. I used cuda 11.0 and torch 1.7.1 and there are some errors occered.
2022-03-22 18:49:37.663 -- Process 2 terminated with the following error:
2022-03-22 18:49:37.663 Traceback (most recent call last):
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 18:49:37.663 fn(i, *args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 18:49:37.663 main(local_args)
2022-03-22 18:49:37.663 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 18:49:37.663 loss, log_info = model(features)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 62, in forward
2022-03-22 18:49:37.663 b_loss, b_log_output = self.b_model.cal_loss(b_logits, inverse_features["target"], inverse_features["target_mask"])
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 182, in cal_loss
2022-03-22 18:49:37.663 loss = self.criterion(net_output, labels)
2022-03-22 18:49:37.663 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 18:49:37.663 result = self.forward(*input, **kwargs)
2022-03-22 18:49:37.663 File "/jizhi/jizhi2/worker/trainer/thualign/modules/losses.py", line 26, in forward
2022-03-22 18:49:37.663 loss = log_probs[batch_idx, labels]
2022-03-22 18:49:37.663 IndexError: tensors used as indices must be long, byte or bool tensors
2022-03-22 16:27:17.919 -- Process 2 terminated with the following error:
2022-03-22 16:27:17.919 Traceback (most recent call last):
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
2022-03-22 16:27:17.919 fn(i, *args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 367, in process_fn
2022-03-22 16:27:17.919 main(local_args)
2022-03-22 16:27:17.919 File "/apdcephfs/private_yipinggao/Mask-Align-main/thualign/bin/trainer.py", line 325, in main
2022-03-22 16:27:17.919 loss, log_info = model(features)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
2022-03-22 16:27:17.919 result = self.forward(*input, **kwargs)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/agreement_wrapper.py", line 53, in forward
2022-03-22 16:27:17.919 f_state = self.f_model.encode(features, f_state)
2022-03-22 16:27:17.919 File "/jizhi/jizhi2/worker/trainer/thualign/models/mask_align.py", line 108, in encode
2022-03-22 16:27:17.919 inputs = torch.nn.functional.embedding(src_seq, self.src_embedding)
2022-03-22 16:27:17.919 File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
2022-03-22 16:27:17.919 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2022-03-22 16:27:17.919 RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
I modify them by adding .long() and it works.
I try to train a mask_align model with default config in the repo (only change data paths) and DE-EN training data from https://github.com/lilt/alignment-scripts. In some of training steps the losses are nan and at end of training the loss increases from about 7 to 70.
epoch = 5, step = 49980, loss: nan, f_loss: nan, b_loss: nan, agree_loss: nan, entropy_loss: nan (0.246 sec)
epoch = 5, step = 49990, loss: 64.210, f_loss: 67.750, b_loss: 60.188, agree_loss: 0.000, entropy_loss: 0.241 (0.507 sec)
epoch = 5, step = 50000, loss: 69.115, f_loss: 72.500, b_loss: 65.312, agree_loss: 0.000, entropy_loss: 0.240 (0.652 sec)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.