Code Monkey home page Code Monkey logo

realworldnlp's Introduction

Real-World Natural Language Processing

This repository contains example code for the book "Real-World Natural Language Processing."

AllenNLP (2.5.0 or above) is required to run the example code in this repository.

Examples included in this repository:

realworldnlp's People

Contributors

mathcass avatar mhagiwara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

realworldnlp's Issues

Error in 2.8.1

Hi,

While trying

predictor = SentenceClassifierPredictor(model, dataset_reader=reader)

in Sec 2.8.1 I'm getting error

AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'

I see you have made some changes in this commit.

Help: examples/mt/mt.py

I try reproduce examples/mt/mt.py but I have CPU/CUDA error:

File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 212, in forward state = self._encode(source_tokens) File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 268, in _encode embedded_input = self._source_embedder(source_tokens) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 123, in forward token_vectors = embedder(*tensors) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 143, in forward sparse=self.sparse) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'

I try play in kaggle enveroment

typos and errata (last updated 2021/05/18)

  • chapter 1, should be text generation

Finally, a third class of text classification is unconditional text generation, where natural language text is generated stochastically from a model. You can train models so that they can generate some random academic papers, Linux source code, or even some poems and play scripts. For example, Andrej Karpathy trained an RNN model form all works of Shakespeare and succeeded in generation pieces of text that look exactly like his work (http://realworldnlpbook.com/ch1.html#karpathy15):

  • 4.2.3 typo swtich in the pseudocode
def update_gru(state, word):
    new_state = update_hidden(state, word)
 
    switch = get_switch(state, word)
 
    state = swtich * new_state + (1 – switch) * state
 
    return state
  • chapter 5: micro and macro should be switched.

original text:

If these metrics are computed while ignoring entity types, it’s called a micro average. For example, the micro-averaged precision is the total number of true positives of all types divided by the total number of retrieved named entities regardless of the type. On the other hand, if these metrics are computed per entity type and then get averaged, it’s called a macro average. For example, if the precision for PER and GPE is 80% and 90%, respectively, its macro average is 85%. What AllenNLP computes in the following is the micro average.

  • 5.6.1

The language detection in a previous chapter used RNN and character as input.

original text:

In the first half of this section, we are going to build an English language model and train it using a generic English corpus. Before we start, we note that the RNN language model we build in this chapter operates on characters, not on words or tokens. All the RNN models we’ve seen so far operate on words, which means the input to the RNN was always sequences of words. On the other hand, the RNN we are going to use in this section takes sequences of characters as the input.

Using GPU

Great Tutorial!
It would be very cool, if you can describe how to use the GPU to run it faster.

a lot of codes are broken in allennlp 2.0

I'm now reading the book and notice a lot of bugs related to allennlp2.0. Does the author consider upgrading the code to allennlp2.0 to make it comply more with the title real world nlp?

It's a pity because this book is I think the only book using allennlp to tackle a range of general nlp tasks and I like it very much.

Some examples:

In the sst_classifier.ipynb one can note:

vocab = Vocabulary.from_instances(train_dataset + dev_dataset,
                                  min_count={'tokens': 3})

gives

unsupported operand type(s) for +: 'generator' and 'generator'

(easily fixable using list(reader.read('train.txt')))

The following two lines

train_dataset.index_with(vocab)
dev_dataset.index_with(vocab)

give

'generator' object has no attribute 'index_with'

and also not specific to allen2.0,

predictor = SentenceClassifierPredictor(model, dataset_reader=reader)

gives

AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'

Positive label for F1 measure is not configured correctly

I reviewed the code: examples/sentiment/sst_classifier.py, and found a bug.

    self.f1_measure = F1Measure(4)

I think this code is intended to measure precision/recall/f1 for the label '4' which is the most positive sentiment. However, the integer 4 here is considered as index in the array representation. It must be converted using label mapping stored in vocab.

Module Not Found Error for Machine Translation

ModuleNotFoundError: No module named 'allennlp.data.dataset_readers.seq2seq'
I'm trying to run the code examples/mt/mt.py with allennlp==1.0.0 and got this error, no code changes direct clone of the repo and tried to run it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.