dreasysnail / pointer Goto Github PK

View Code? Open in Web Editor NEW

112.0 112.0 19.0 1.42 MB

License: MIT License

Shell 0.08% Python 99.92%

pointer's People

Contributors

Stargazers

Watchers

Forkers

chensongcan yyht ywb2018 kiminh renhaocui zzsfornlp thedarkzeno siv2r xuan-zw chenyangh misoknisky airysen wangxuekui zurichrain sejoonoh githubxuexixuexi

pointer's Issues

How to define the model's usage scenario ?

About open source

why Inner-Layer beam search and evaluate criterion not open source？

About automatic evaluation

Thanks for this interesting project! Could you share the scripts for automatic evaluation?

When will you open source your code and data sets

When will you open source your code and data sets？

About open train code and datasets

Will you please open the source code for the train ?

Hi There, i got some Q and suggestions.

thanks for your code =.=

Q: the model faces the problem with Error Accumulation, which i suggest use '[DEL]' token to let model know when to delete which words. just like the NAT method: levenshtein. the different is levenshtein do [where to insert / insert / delete] in different steps, yous can do in one step. But one step also get the problem with Unstable, you can got a result that insert and delete all happens in same step. after my test, '[DEL]' get PPL 10 scores decrease.
Q: lack of knowledge. this happens when i try the constraned the train data in some metaphor or parallelism data. but the result shows the output doesn't have the strong logic inside compared to GPT-2, it trends to generate [no, can't, doesn't; etc ] words which can completely change the means of sentences and bring the problem with different means in different part of sentences. i don't know how to fix it. maybe something like knowledge-bert ? maybe this is the disadvantages of NAT methods compare to AR model like GPT-2 which can't be solved, because of the unstable generate pattern?
your inference code in greedy_search maybe toooooo slow WHEN inference a batch data? i suggest a torch-mask version[1. get the index to mask , 2. use torch.scatter or torch.mask_fill etc to inference a batch data], after somedays I'll take a push requests and please check it.

About open source code

Will you publish the training code and the training strategy？

Restaurant review fine-tuned model link problem

The link points to the dataset instead of model checkcpoint?

add check for conda installation on env_setup.sh

There is no check to see if the user has conda install on the computer in env_setup.sh. Hence, all the pip install commands inside env_setup.sh will install the packages in the base environment if the user does not have conda installed.

Why is the calculation of token importance score in generate_training_data.py so inconsistent with that in the paper?

How to restrict some key words as a whole instead of seperating them?

For example, I have the following key words:

Hello kitty blanket Fleece

I want to treat the whole of "Hello kitty" as a key word, and generates text similar to the following:

Cartoon Hello Kitty Printing Throw Blanket Soft Cover Flannel Cozy Plush Fleece Blanket for Boys Girls Kids

Sentences like '* hello * kitty ...' are not corrected.

The generated text is not as good as in the examples

Hello, I'm trying to reproduce the POINTER model behaviour using provided in the repository checkpoints and keys, however the generated text I'm receiving, is not as attractive as the text examples generated with the deployed demo at http://52.247.25.3:8900/

I wonder what parameters are passed in args in order to get the same generation behaviour as from the http://52.247.25.3:8900/?

About pre-training wiki dataset

Thanks for this interesting project! I am wondering:

Could you please share the link of the pre-training dataset (i.e. wiki data)?
Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?

slot nums

i didn't understand that for sequences {sos , x1, x2, ..., xt, eos}, shouldn't it be t+1 slots as described in INSERTION TRANSFORMER? which means (sos ,x1) (x1,x2),...(xt, eos), but in your words, there is always t slots, which confused me a lot.

Checkpoint Binary Hosting seems too slow

Hi, I was deeply impressed by your work POINTER.
So I tried to finetune your wiki-pretrained checkpoint on my custom dataset.
But downloading from your provided link seems too slow or not working. Can I get this checkpoint from other route? Thanks in advance.

How to generate multiple samples from a group of key words using "--type sampling"

For a set of given key words, how to create diverse text generation like the live demo?

How to generate shorter sentences?

Hi,

first, thanks for your amazing work on POINTER!

I am working on an application to paraphrase generation from the source sentence keywords for my PhD, but in my experience the paraphrases generated tend to be 100 to 160 words which is 3-4 times longer than my sources, even after fine-tuning.

In your opinion, what would be the best way to generate shorter paraphrases? The [No insertion] probability knob (with the risk of falling out of the pre-training domain), retraining from scratch on shorter sentences, or any other idea?

Thanks!

Add `.gitignore` file to the repo

The pycache directory present in the repo is not necessary. Creating a .gitignore file and adding __pychache__ will do the trick.

training on a new dataset

About News Dataset

In your paper, The EMNLP2017 WMT News dataset5 contains 268,586 sentences, but there are lots of datasets in url http://www.statmt.org/wmt17/ and I have no sense which one is the dataset used in experiments. I'd be appreciated if you provide some details.

about constructing data

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

About Yelp dataset

Could you provide the yelp dataset or how it is extracted, please? The yelp dataset in Paper《Towards coherent and cohesive long-form text generation》is different from which is used in this paper.

Yelp dataset cannot be downloaded.

Sorry to be a bother. The yelp dataset(restaurant review) cannot be downloaded. When I visited that url, an error came out.
AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:936f42d0-c01e-002b-3009-cad595000000 Time:2020-12-04T06:49:53.3054174ZSignature not valid in the specified time frame: Start [Wed, 02 Dec 2020 18:29:28 GMT] - Expiry [Thu, 03 Dec 2020 18:29:28 GMT] - Current [Fri, 04 Dec 2020 06:49:53 GMT]
I'd appreciate if you could fix it.