atpaino / deep-text-corrector Goto Github PK

Deep learning models trained to correct input errors in short, message-like text

License: Apache License 2.0

Jupyter Notebook 29.92% Python 70.08%

deep-text-corrector's Introduction

Deep Text Corrector

Deep Text Corrector uses TensorFlow to train sequence-to-sequence models that are capable of automatically correcting small grammatical errors in conversational written English (e.g. SMS messages). It does this by taking English text samples that are known to be mostly grammatically correct and randomly introducing a handful of small grammatical errors (e.g. removing articles) to each sentence to produce input-output pairs (where the output is the original sample), which are then used to train a sequence-to-sequence model.

See this blog post for a more thorough write-up of this work.

Motivation

While context-sensitive spell-check systems are able to automatically correct a large number of input errors in instant messaging, email, and SMS messages, they are unable to correct even simple grammatical errors. For example, the message "I'm going to store" would be unaffected by typical autocorrection systems, when the user most likely intendend to write "I'm going to the store". These kinds of simple grammatical mistakes are common in so-called "learner English", and constructing systems capable of detecting and correcting these mistakes has been the subect of multiple CoNLL shared tasks.

The goal of this project is to train sequence-to-sequence models that are capable of automatically correcting such errors. Specifically, the models are trained to provide a function mapping a potentially errant input sequence to a sequence with all (small) grammatical errors corrected. Given these models, it would be possible to construct tools to help correct these simple errors in written communications, such as emails, instant messaging, etc.

Correcting Grammatical Errors with Deep Learning

The basic idea behind this project is that we can generate large training datasets for the task of grammar correction by starting with grammatically correct samples and introducing small errors to produce input-output pairs, which can then be used to train a sequence-to-sequence models. The details of how we construct these datasets, train models using them, and produce predictions for this task are described below.

Datasets

To create a dataset for Deep Text Corrector models, we start with a large collection of mostly grammatically correct samples of conversational written English. The primary dataset considered in this project is the Cornell Movie-Dialogs Corpus, which contains over 300k lines from movie scripts. This was the largest collection of conversational written English I could find that was mostly grammatically correct.

Given a sample of text like this, the next step is to generate input-output pairs to be used during training. This is done by:

Drawing a sample sentence from the dataset.
Setting the input sequence to this sentence after randomly applying certain perturbations.
Setting the output sequence to the unperturbed sentence.

where the perturbations applied in step (2) are intended to introduce small grammatical errors which we would like the model to learn to correct. Thus far, these perturbations are limited to the:

subtraction of articles (a, an, the)
subtraction of the second part of a verb contraction (e.g. "'ve", "'ll", "'s", "'m")
replacement of a few common homophones with one of their counterparts (e.g. replacing "their" with "there", "then" with "than")

The rates with which these perturbations are introduced are loosely based on figures taken from the CoNLL 2014 Shared Task on Grammatical Error Correction. In this project, each perturbation is applied in 25% of cases where it could potentially be applied.

Training

To artificially increase the dataset when training a sequence model, we perform the sampling strategy described above multiple times to arrive at 2-3x the number of input-output pairs. Given this augmented dataset, training proceeds in a very similar manner to TensorFlow's sequence-to-sequence tutorial. That is, we train a sequence-to-sequence model using LSTM encoders and decoders with an attention mechanism as described in Bahdanau et al., 2014 using stochastic gradient descent.

Decoding

Instead of using the most probable decoding according to the seq2seq model, this project takes advantage of the unique structure of the problem to impose the prior that all tokens in a decoded sequence should either exist in the input sequence or belong to a set of "corrective" tokens. The "corrective" token set is constructed during training and contains all tokens seen in the target, but not the source, for at least one sample in the training set. The intuition here is that the errors seen during training involve the misuse of a relatively small vocabulary of common words (e.g. "the", "an", "their") and that the model should only be allowed to perform corrections in this domain.

This prior is carried out through a modification to the seq2seq model's decoding loop in addition to a post-processing step that resolves out-of-vocabulary (OOV) tokens:

Biased Decoding

To restrict the decoding such that it only ever chooses tokens from the input sequence or corrective token set, this project applies a binary mask to the model's logits prior to extracting the prediction to be fed into the next time step. This mask is constructed such that mask[i] == 1.0 if (i in input or corrective_tokens) else 0.0. Since this mask is applited to the result of a softmax transormation (which guarantees all outputs are non-negative), we can be sure that only input or corrective tokens are ever selected.

Note that this logic is not used during training, as this would only serve to eliminate potentially useful signal from the model.

Handling OOV Tokens

Since the decoding bias described above is applied within the truncated vocabulary used by the model, we will still see the unknown token in its output for any OOV tokens. The more generic problem of resolving these OOV tokens is non-trivial (e.g. see Addressing the Rare Word Problem in NMT), but in this project we can again take advantage of its unique structure to create a fairly straightforward OOV token resolution scheme. That is, if we assume the sequence of OOV tokens in the input is equal to the sequence of OOV tokens in the output sequence, then we can trivially assign the appropriate token to each "unknown" token encountered int he decoding. Empirically, and intuitively, this appears to be an appropriate assumption, as the relatively simple class of errors these models are being trained to address should never include mistakes that warrant the insertion or removal of a rare token.

Experiments and Results

Below are some anecdotal and aggregate results from experiments using the Deep Text Corrector model with the Cornell Movie-Dialogs Corpus. The dataset consists of 304,713 lines from movie scripts, of which 243,768 lines were used to train the model and 30,474 lines each were used for the validation and testing sets. The sets were selected such that no lines from the same movie were present in both the training and testing sets.

The model being evaluated below is a sequence-to-sequence model, with attention, where the encoder and decoder were both 2-layer, 512 hidden unit LSTMs. The model was trained with a vocabulary of the 2k most common words seen in the training set.

Aggregate Performance

Below are reported the BLEU scores and accuracy numbers over the test dataset for both a trained model and a baseline, where the baseline is the identity function (which assumes no errors exist in the input).

You'll notice that the model outperforms this baseline for all bucket sizes in terms of accuracy, and outperforms all but one in terms of BLEU score. This tells us that applying the Deep Text Corrector model to a potentially errant writing sample would, on average, result in a more grammatically correct writing sample. Anyone who tends to make errors similar to those the model has been trained on could therefore benefit from passing their messages through this model.

Bucket 0: (10, 10)
        Baseline BLEU = 0.8341
        Model BLEU = 0.8516
        Baseline Accuracy: 0.9083
        Model Accuracy: 0.9384
Bucket 1: (15, 15)
        Baseline BLEU = 0.8850
        Model BLEU = 0.8860
        Baseline Accuracy: 0.8156
        Model Accuracy: 0.8491
Bucket 2: (20, 20)
        Baseline BLEU = 0.8876
        Model BLEU = 0.8880
        Baseline Accuracy: 0.7291
        Model Accuracy: 0.7817
Bucket 3: (40, 40)
        Baseline BLEU = 0.9099
        Model BLEU = 0.9045
        Baseline Accuracy: 0.6073
        Model Accuracy: 0.6425

Examples

Decoding a sentence with a missing article:

In [31]: decode("Kvothe went to market")
Out[31]: 'Kvothe went to the market'

Decoding a sentence with then/than confusion:

In [30]: decode("the Cardinals did better then the Cubs in the offseason")
Out[30]: 'the Cardinals did better than the Cubs in the offseason'

Implementation Details

This project reuses and slightly extends TensorFlow's Seq2SeqModel, which itself implements a sequence-to-sequence model with an attention mechanism as described in https://arxiv.org/pdf/1412.7449v3.pdf. The primary contributions of this project are:

data_reader.py: an abstract class that defines the interface for classes which are capable of reading a source dataset and producing input-output pairs, where the input is a grammatically incorrect variant of a source sentence and the output is the original sentence.
text_corrector_data_readers.py: contains a few implementations of DataReader, one over the Penn Treebank dataset and one over the Cornell Movie-Dialogs Corpus.
text_corrector_models.py: contains a version of Seq2SeqModel modified such that it implements the logic described in Biased Decoding
correct_text.py: a collection of helper functions that together allow for the training of a model and the usage of it to decode errant input sequences (at test time). The decode method defined here implements the OOV token resolution logic. This also defines a main method, and can be invoked from the command line. It was largely derived from TensorFlow's translate.py.
TextCorrector.ipynb: an IPython notebook which ties together all of the above pieces to allow for the training and evaluation of the model in an interactive fashion.

Example Usage

Note: this project requires TensorFlow version >= 0.11. See this page for setup instructions.

Preprocess Movie Dialog Data

python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt \
                                                 --out_file preprocessed_movie_lines.txt

This preprocessed file can then be split up however you like to create training, validation, and testing sets.

Training:

python correct_text.py --train_path /movie_dialog_train.txt \
                       --val_path /movie_dialog_val.txt \
                       --config DefaultMovieDialogConfig \
                       --data_reader_type MovieDialogReader \
                       --model_path /movie_dialog_model

Testing:

python correct_text.py --test_path /movie_dialog_test.txt \
                       --config DefaultMovieDialogConfig \
                       --data_reader_type MovieDialogReader \
                       --model_path /movie_dialog_model \
                       --decode

deep-text-corrector's People

Contributors

Stargazers

Watchers

Forkers

xcopyco wavelets hbcbh1999 igorcosta vyraun neuroradiology cristiberceanu nmstoker dantodor gdtm86 vibster richardknop abbi031892 little1tow euwen laxas kjeanclaude ematvey wuzhongdehua djlzq bradparks metricle xsongx allensmile fancyerii nilopc-tensorflow-learning snakeroot91 ml-lab codeaudit benjamesbabala neo4reo andysdc caili5104 dtsukiyama tpys leezqcst kingofoz pombredanne cdo03c 1beb codezixo rajivpoddar zhangs06 ml-ai-nlp-ir jwilk-forks floydhub nagyistge xushenkun shuvayan karthi2016 raghavendranpm trampolinerocket prakash19921206 zgsxwsdxg terrytowne sr-vz zhoudan0215 yasutaka nbgroupp janbussieck kmugash youngcube leavesster colinsongf s4sarath lumiqai jinjiaji512 qsong4 jasonhoo95 b2220333 ieee820 jinyeong gooklim ramaswamym1987 qinbill binhnq94 ravibansal keyboardwitch thangduong sikisikiliu sreendra bigrlab lsq357 xxueo sadhumangal songchenli xiaoqiangkx skybirdhe daijianxin ufukhurriyetoglu lab930boss wximo emersonzyh flyland68 minsu-daniel-kim catcatrun arianpasquali hothanhluan bnuside satadru5

deep-text-corrector's Issues

Cannot execute your code due to missing attribute '_linear'

Hi Alex, thanks for your great work!! I tried executing your main execution file ("textcorrector.ipnyb"), but I keep getting this error message: AttributeError: module 'tensorflow.python.ops.rnn_cell' has no attribute '_linear'. I ran your code using Jupyter Notebook, with Python's 3.5 version (latest), and tensorflow's 1.2.1 version (latest too). I don't understand why it keeps saying certain module lacks of the essential attribute to run your code. Could you please help explain why this happens, Alex?

'Variable proj_w does not exist, or was not created with tf.get_variable(). ' on Google Colab

The code works on my local environment, while the training is too slow so I move it to Google Colab. Then I got 'Variable proj_w already exists, disallowed. ' while the 4th block of the code executing.

I searched and found that it always uses with tf.variable_scope while using tf.get_variable, then I thought it might be worked if I change tf.get_variable to tf.Varable but it didn't. The error became:

ValueError Traceback (most recent call last)
in ()
----> 1 train(data_reader, train_path, val_path, model_path)

/content/drive/My Drive/ColabNotebooks/grammarCorrection/correct_text.py in train(data_reader, train_path, test_path, model_path)
145 "Creating %d layers of %d units." % (
146 config.num_layers, config.size))
--> 147 model = create_model(sess, False, model_path, config=config)
148
149 # Read data into buckets and compute their sizes.

/content/drive/My Drive/ColabNotebooks/grammarCorrection/correct_text.py in create_model(session, forward_only, model_path, config)
122 use_lstm=config.use_lstm,
123 forward_only=forward_only,
--> 124 config=config)
125 ckpt = tf.train.get_checkpoint_state(model_path)
126 if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path):

/content/drive/My Drive/ColabNotebooks/grammarCorrection/text_corrector_models.py in init(self, source_vocab_size, target_vocab_size, buckets, size, num_layers, max_gradient_norm, batch_size, learning_rate, learning_rate_decay_factor, use_lstm, num_samples, forward_only, config, corrective_tokens_mask)
108 if self.target_vocab_size > num_samples > 0:
109 # w = tf.get_variable("proj_w", [size, self.target_vocab_size])
--> 110 w = tf.Variable([size, self.target_vocab_size], 'proj_w')
111 w_t = tf.transpose(w)
112 # b = tf.get_variable("proj_b", [self.target_vocab_size])

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
1485 constraint=constraint,
1486 synchronization=synchronization,
-> 1487 aggregation=aggregation)
1488
1489

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, var_store, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
1235 constraint=constraint,
1236 synchronization=synchronization,
-> 1237 aggregation=aggregation)
1238
1239 def _get_partitioned_variable(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
538 constraint=constraint,
539 synchronization=synchronization,
--> 540 aggregation=aggregation)
541
542 def _get_partitioned_variable(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in _true_getter(name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, constraint, synchronization, aggregation)
490 constraint=constraint,
491 synchronization=synchronization,
--> 492 aggregation=aggregation)
493
494 # Set trainable value based on synchronization value.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in _get_single_variable(self, name, shape, dtype, initializer, regularizer, partition_info, reuse, trainable, collections, caching_device, validate_shape, use_resource, constraint, synchronization, aggregation)
877 raise ValueError("Variable %s does not exist, or was not created with "
878 "tf.get_variable(). Did you mean to set "
--> 879 "reuse=tf.AUTO_REUSE in VarScope?" % name)
880
881 # Create the tensor to initialize the variable with default value.

ValueError: Variable proj_w does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?

I'm still stuck in this error, anyone can help?

how to train customize word :

Hi, I like your model but I want to know how to train customize word :
like U.S..S.A -> U.S.A

train_path required for decode?

In the example at the end of the README, decode is called with test_path but not train_path. (That makes sense to me.)

However, in correct_text.py main, FLAGS.train_path is still required even for the code path that runs when FLAGS.decode is true.

Should I change the README, or correct_text.py?

file not found error

when i m trying to run python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt
it gives me an error like file not found -
Traceback (most recent call last):
File "preprocessors/preprocess_movie_dialogs.py", line 23, in
tf.app.run()
File "C:\Users\Sarve\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Sarve\AppData\Roaming\Python\Python37\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\Sarve\AppData\Roaming\Python\Python37\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "preprocessors/preprocess_movie_dialogs.py", line 14, in main
open(FLAGS.out_file, "w") as out:
FileNotFoundError: [Errno 2] No such file or directory: ''

can u explain me how to run this?

KeyError: 'UNK'

When I run your project ,this error occurs. How to solve this problem?

Traceback (most recent call last):
File "correct_text.py", line 438, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "correct_text.py", line 413, in main
data_reader = MovieDialogReader(config, FLAGS.train_path)
File "/opt/yangzhanku/correct_text/deep-text-corrector-master/text_corrector_data_readers.py", line 88, in init
self.UNKNOWN_ID = self.token_to_id[MovieDialogReader.UNKNOWN_TOKEN]
KeyError: 'UNK'

sampled_loss() got an unexpected keyword argument 'logits'

@atpaino This error occured when running text_corrector_models.py

Why lowercase in preproc?

I noticed that the code lowers in preproc.

https://github.com/atpaino/deep-text-corrector/search?utf8=%E2%9C%93&q=lower%28%29&type=

Because of this:

The system can't use case as a clue.
The system can't correct case.

Did you try it without lowering at first, and there were problems?

(My instinct would be to avoid canonicalisation, and fight the out-of-dataset tokens with data.)

Module not found

Can someone provide me with a compiled and executable version of the project for i can not compile the file as it shows error of module not found for tensorflow and I need the project urgently?

What version of tensorflow does this code work on?

I tried running this code with multiple tensorflow versions (1.13, 1.1, 0.12) but it keeps giving some error or the other, specifically related to rnn_cell. (cannot import name rnn_cell). Even if I resolve it using contrib package, then I keep getting subsequent errors.
Can someone please tell me which version of tensorflow does this code work with without any errors?
Also, does it work with a specific version of python as well?

Thanks
Aayushee

I am getting movie_dialog_train.txt not found Error

when I run this command python correct_text.py --train_path /movie_dialog_train.txt --val_path /movie_dialog_val.txt \ --config DefaultMovieDialogConfig \ --data_reader_type MovieDialogReader \ --model_path /movie_dialog_model
IOError: [Errno 2] No such file or directory: '/movie_dialog_train.txt' this error is showing up.
Am I missing something here? I cannot find this text file in Cornell corpus also. I'm trying to build a grammar checker for my project. Can anyone help me with this issue?

I run decode ,then has a error

(env-0.12.0) root@op-System-Product-Name:/home/github/deep-text-corrector# cat predict.sh
python correct_text.py --test_path ./test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --model_path ./movie_dialog_model --decode

may I run "python correct_text.py --train_path ./movie_dialog_train.txt --test_path ./test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --model_path ./movie_dialog_model --decode"????

add -train_path ./movie_dialog_train.txt
????

'zip' object is not subscriptable

I am getting this error when I try to run data_reader.
"TypeError: 'zip' object is not subscriptable"

'zip' object is not subscriptable

I have the same problem as here

I changed line 46 to self.token_to_id = dict((k, self.full_token_to_id[k]) for k in list(self.full_token_to_id.keys())[:max_vocabulary_size])

But still got the error:

     44             full_token_and_id = zip(vocabulary, range(len(vocabulary)))
     45             self.full_token_to_id = dict(full_token_and_id)
---> 46             self.token_to_id = dict((k, self.full_token_to_id[k]) for k in list(self.full_token_to_id.keys())[:max_vocabulary_size])
     47 
     48         self.id_to_token = {v: k for k, v in self.token_to_id.items()}

TypeError: 'zip' object is not subscriptable

I run decode ,then has a error ?

(env-0.12.0) root@op-System-Product-Name:/home/github/deep-text-corrector# ./predict.sh
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
File "correct_text.py", line 439, in
tf.app.run()
File "/home/env/python3.5/env-0.12.0/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "correct_text.py", line 414, in main
data_reader = MovieDialogReader(config, FLAGS.train_path)
File "/home/github/deep-text-corrector/text_corrector_data_readers.py", line 82, in init
dataset_copies=dataset_copies)
File "/home/github/deep-text-corrector/data_reader.py", line 32, in init
for tokens in self.read_tokens(train_path):
File "/home/github/deep-text-corrector/text_corrector_data_readers.py", line 114, in read_tokens
with open(path, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'train'

run script ?
python correct_text.py --test_path ./test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --model_path ./movie_dialog_model --decode

why???????????????????? but FLAGS.train_path is None

Problem in Run

Can someone tell me how to run this project???

'zip' object is not subscriptable

When i tried running
python correct_text.py --train_path /movie_dialog_train.txt
--val_path /movie_dialog_val.txt
--config DefaultMovieDialogConfig
--data_reader_type MovieDialogReader
--model_path /movie_dialog_model

it gives me error

File "correct_text.py", line 438, in
tf.app.run()
File "/home/abhinavsingh/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "correct_text.py", line 413, in main
data_reader = MovieDialogReader(config, FLAGS.train_path)
File "/home/abhinavsingh/deep-text-corrector-master/text_corrector_data_readers.py", line 82, in init
dataset_copies=dataset_copies)
File "/home/abhinavsingh/deep-text-corrector-master/data_reader.py", line 46, in init
self.token_to_id = dict(full_token_and_id[:max_vocabulary_size])
TypeError: 'zip' object is not subscriptable

module '_pywrap_tensorflow_internal' has no attribute 'TF_ListPhysicalDevices'

Getting this error when i run the preprocessing python itself.

How many steps does it need to run for to get decent results ?

Have run it for 30K steps, but I am not getting a corrected output. I get the same output as whats fed into the input.

Input : this is table
Output : this is table

I am expecting it to insert the article and give me "this is a table"
How many more steps should I run it for ?

KeyError: 'UNK'

def init(self, config, train_path=None, token_to_id=None,
dropout_prob=0.25, replacement_prob=0.25, dataset_copies=2):
super(MovieDialogReader, self).init(
config, train_path=train_path, token_to_id=token_to_id,
special_tokens=[
PAD_TOKEN, GO_TOKEN, EOS_TOKEN,
MovieDialogReader.UNKNOWN_TOKEN],
dataset_copies=dataset_copies)

    self.dropout_prob = dropout_prob
    self.replacement_prob = replacement_prob
    self.UNKNOWN_ID = self.token_to_id[MovieDialogReader.UNKNOWN_TOKEN]

#last line gives error
#I dont understand where UNKNOWN_ID is coming from and what token_to_id actually is

txt files and model.

How do I create cleaned_dialog_val.txt.,cleaned_dialog_test.txt,this model :dialog_correcter_model_testnltk

Decoding is repeating the same word

Hello,
I have an issue

decoded = decode_sentence(sess, model, data_reader, "you must have girlfriend", corrective_tokens=corrective_tokens)
Input: you must have girlfriend
Output: you you you you you you you you you you

Any one has an idea please?
Many thanks

Cannot replicate

I trained the model as specified in the readme but cannot replicate the results. The following is what I get.

Input: you must have girlfriend
Output: than than than than than than than than than than

Is this because of the training/dataset?

'str' object has no attribute 'decode'

When i tried running
python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt
--out_file preprocessed_movie_lines.txt

it gives me error
python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt --out_file preprocessed_movie_lines.txt
/home/abhinavsingh/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "preprocessors/preprocess_movie_dialogs.py", line 24, in
tf.app.run()
File "/home/abhinavsingh/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "preprocessors/preprocess_movie_dialogs.py", line 18, in main
s = dialog_line.strip().lower().decode("utf-8", "ignore")
AttributeError: 'str' object has no attribute 'decode'

But this is obvious as each line is string but if i remove decode then it dosen't working.

I got the next error:
ModuleNotFoundError: No module named 'text_correcter_data_readers'

I tried to fix it to adding a path:

import sys
sys.path.append('C:\\my_path\\deep-text-corrector-master')

And adding an empty __init__.py file in deep-text-corrector-master' directory.

But it didn't help either.

Result err

Hi atpaino,
I have run your project,but I cannot get the right result like the examples you give.My result likes below:
input:you must have girlfriend
output:you must have

Could you help me to analysis the reason about it,
thanks a lot

atpaino / deep-text-corrector Goto Github PK

deep-text-corrector's Introduction

Deep Text Corrector

Motivation

Correcting Grammatical Errors with Deep Learning

Datasets

Training

Decoding

Experiments and Results

Aggregate Performance

Examples

Implementation Details

Example Usage

deep-text-corrector's People

Contributors

Stargazers

Watchers

Forkers

deep-text-corrector's Issues

Recommend Projects

Recommend Topics

Recommend Org