Code Monkey home page Code Monkey logo

pointer's People

Contributors

dreasysnail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pointer's Issues

Hi There, i got some Q and suggestions.

thanks for your code =.=

  1. Q: the model faces the problem with Error Accumulation, which i suggest use '[DEL]' token to let model know when to delete which words. just like the NAT method: levenshtein. the different is levenshtein do [where to insert / insert / delete] in different steps, yous can do in one step. But one step also get the problem with Unstable, you can got a result that insert and delete all happens in same step. after my test, '[DEL]' get PPL 10 scores decrease.
  2. Q: lack of knowledge. this happens when i try the constraned the train data in some metaphor or parallelism data. but the result shows the output doesn't have the strong logic inside compared to GPT-2, it trends to generate [no, can't, doesn't; etc ] words which can completely change the means of sentences and bring the problem with different means in different part of sentences. i don't know how to fix it. maybe something like knowledge-bert ? maybe this is the disadvantages of NAT methods compare to AR model like GPT-2 which can't be solved, because of the unstable generate pattern?
  3. your inference code in greedy_search maybe toooooo slow WHEN inference a batch data? i suggest a torch-mask version[1. get the index to mask , 2. use torch.scatter or torch.mask_fill etc to inference a batch data], after somedays I'll take a push requests and please check it.

add check for conda installation on env_setup.sh

There is no check to see if the user has conda install on the computer in env_setup.sh. Hence, all the pip install commands inside env_setup.sh will install the packages in the base environment if the user does not have conda installed.

How to restrict some key words as a whole instead of seperating them?

For example, I have the following key words:

Hello kitty blanket Fleece

I want to treat the whole of "Hello kitty" as a key word, and generates text similar to the following:

Cartoon Hello Kitty Printing Throw Blanket Soft Cover Flannel Cozy Plush Fleece Blanket for Boys Girls Kids

Sentences like '* hello * kitty ...' are not corrected.

About pre-training wiki dataset

Thanks for this interesting project! I am wondering:

  1. Could you please share the link of the pre-training dataset (i.e. wiki data)?
  2. Your paper mentioned that the sentences in wiki data are 1.99 million and about 12.6GB, but we found that the original wiki data (90 million sentences) is about 23GB. They look inconsistent, do I misunderstand something?

slot nums

i didn't understand that for sequences {sos , x1, x2, ..., xt, eos}, shouldn't it be t+1 slots as described in INSERTION TRANSFORMER? which means (sos ,x1) (x1,x2),...(xt, eos), but in your words, there is always t slots, which confused me a lot.

Checkpoint Binary Hosting seems too slow

Hi, I was deeply impressed by your work POINTER.
So I tried to finetune your wiki-pretrained checkpoint on my custom dataset.
But downloading from your provided link seems too slow or not working. Can I get this checkpoint from other route? Thanks in advance.

How to generate shorter sentences?

Hi,

first, thanks for your amazing work on POINTER!

I am working on an application to paraphrase generation from the source sentence keywords for my PhD, but in my experience the paraphrases generated tend to be 100 to 160 words which is 3-4 times longer than my sources, even after fine-tuning.

In your opinion, what would be the best way to generate shorter paraphrases? The [No insertion] probability knob (with the risk of falling out of the pre-training domain), retraining from scratch on shorter sentences, or any other idea?

Thanks!

Add `.gitignore` file to the repo

The pycache directory present in the repo is not necessary. Creating a .gitignore file and adding __pychache__ will do the trick.

About News Dataset

In your paper, The EMNLP2017 WMT News dataset5 contains 268,586 sentences, but there are lots of datasets in url http://www.statmt.org/wmt17/ and I have no sense which one is the dataset used in experiments. I'd be appreciated if you provide some details.

about constructing data

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

About Yelp dataset

Could you provide the yelp dataset or how it is extracted, please? The yelp dataset in Paper《Towards coherent and cohesive long-form text generation》is different from which is used in this paper.

Yelp dataset cannot be downloaded.

Sorry to be a bother. The yelp dataset(restaurant review) cannot be downloaded. When I visited that url, an error came out.
AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. RequestId:936f42d0-c01e-002b-3009-cad595000000 Time:2020-12-04T06:49:53.3054174ZSignature not valid in the specified time frame: Start [Wed, 02 Dec 2020 18:29:28 GMT] - Expiry [Thu, 03 Dec 2020 18:29:28 GMT] - Current [Fri, 04 Dec 2020 06:49:53 GMT]
I'd appreciate if you could fix it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.