Code Monkey home page Code Monkey logo

Comments (10)

lyeoni avatar lyeoni commented on July 22, 2024

@davidniki02
I wrote simple sample code to know how to save/load the model.
After loading the model, you can predict/do classification as you did.

# save model architecture 
model_json = model.to_json()
with open("model.json", "w") as json_file : 
    json_file.write(model_json)

# save weights 
model.save_weights("model.h5")
# load model architecture
from keras.models import model_from_json
with open("model.json", "r") as json_file:
    loaded_model_json = json_file.read()
loaded_model = model_from_json(loaded_model_json)

# load weights
loaded_model.load_weights("model.h5")

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

@lyeoni thanks, but I think we also need to store the tokenizer? I stored it using pickle.

Here is my code but it predicts the same category all the time:


from keras.models import load_model
import keras.preprocessing.text as kpt
import numpy as np
import pandas as pd
import pickle

# loading
model = load_model('model_20190519141343.h5')

handle = open('tokenizer.pickle', 'rb')
tk = pickle.load(handle)

data = pd.read_json('News_Category_Dataset_v2.json', lines=True).drop(['authors', 'date', 'link'], axis=1)

while 1:
	text = input("Say something: ")
	
	if len(text) == 0:
		break

	#tk.fit_on_texts(text)
	#converts the texts to the index equivalents in our dictionary
	pred = tk.texts_to_sequences([text])
	print(pred)
	
	#onehot representation of all words in the evaluation text, and how they appear in our dictionary
	#input = tk.sequences_to_matrix(pred, mode='binary')
	
	arr = np.zeros(50).reshape(1, 50)
	print(arr)
	for i, word in enumerate(pred[0]):
		arr[0][i] = word
	print(arr)

	prediction = model.predict(arr)
	print(prediction)
	cls = np.argmax(prediction)
	
	print(cls)
	print(data['category'][cls])

from nlp-tutorial.

lyeoni avatar lyeoni commented on July 22, 2024

@davidniki02

It depends on what kind of tokenizer you use.
For example, if you use nltk.mosestokenizer (in nltk.tokenize.moses), you don't need to save/load the saved tokenizer. Just call the function, and use returned tokenizer instance.

>>> m = MosesTokenizer()
>>> m.tokenize('2016, pp.')
    [u'2016', u',', u'pp', u'.']

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

Thanks @lyeoni,
You are using MosesTokenizer in tokenization_en.py but in data_loader.py, it is using Tokenizer from keras.preprocessing.text
I updated token_to_index in data_loader to store the tokenizer:
self.tokenizer = tokenizer
then saved it during training:

    # save model
    model.save('model_'+current+'.h5')
    print('MODEL SAVED')
	
	# saving
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(loader.tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

The problem is it predicts the same category all the time (with the same Huffpost dataset you have used)
Also, MosesTokenizer does not have any texts_to_sequences so I don't think it can be used directly for prediction?

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

Any luck, @lyeoni ?

from nlp-tutorial.

lyeoni avatar lyeoni commented on July 22, 2024

Sorry for the delay in replying,

The reason why I used 2 tokenizers(MosesTokenizer, Keras tokenizer) is:

  • The actual tokenization is only preformed by MosesTokenizer in tokenization.py. Keras tokenizer in data_loader.py is only used to easily change words into indexes and build vocabulary.
  • So, if you tokenize corpus (already tokenzied by MosesTokenizer) using Keras tokenizer , there's not much difference.

@davidniki02,
I want to know if your tokenizer is properly saved/loaded. Because I'm not sure of if the method to save tokenizer works well. Please, check that the sample text is correctly tokenized.

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

Thanks for replying @lyeoni
Using MosesTokenizer does not return the numeric representation of the array we need to pass to model.predict

Here is the latest not working code:

# loading
model = load_model('model_20190519141343.h5')

handle = open('tokenizer.pickle', 'rb')
#tk = pickle.load(handle)
tk = MosesTokenizer()

data = pd.read_json('News_Category_Dataset_v2.json', lines=True).drop(['authors', 'date', 'link'], axis=1)

while 1:
	text = input("Say something: ")
	
	if len(text) == 0:
		break

	#tk.fit_on_texts(text)
	#converts the texts to the index equivalents in our dictionary
	tokens = tk.tokenize(text.strip(), escape=False)
	print(tokens)

	arr = np.zeros(50).reshape(1, 50)
	print(arr)
	for i, word in enumerate(tokens):
		arr[0][i] = word
	print(arr)

	prediction = model.predict(arr)
	print(prediction)
	cls = np.argmax(prediction)
	
	print(cls)
	print(data['category'][cls])

I probably need to use keras tokenizer to convert it to numbers (and get rid of the numpy array), e.g.

tokenizer = Tokenizer(num_words = 50000+1, oov_token='UNK')
tokenizer.texts_to_sequences(tokens)

but I don't know if reload the corpus and concatenate the new text to it, if I need to fit_text etc.

# token_to_index
tokens = tokenized_corpus.apply(lambda i: i.split())
tokenizer.fit_on_texts(tokens)
tokenizer.word_index = {word:index for word, index in tokenizer.word_index.items() if index <= 50000}

It's getting a bit confusing. Can you show how the code should actually look?

from nlp-tutorial.

lyeoni avatar lyeoni commented on July 22, 2024

@davidniki02 ,

In your code, tokenizer is initialized/fit every time.
But, fit_on_texts function should be called for all corpus that you have.
(reference link: http://faroit.com/keras-docs/1.2.2/preprocessing/text/)

  • fit_on_texts(texts):
    • Arguments:
      • texts: list of texts to train on.

And, you don't have to use MosesTokenizer because Keras Tokenizer work enough well.
Without MosesTokenizer, just use following code. (tokenized_corpus could be the original corpus, not pre-processed)

def token_to_index(self, tokenized_corpus, maximum_word_num):
   tokenizer = Tokenizer(num_words = maximum_word_num+1, oov_token='UNK')
        
   # tokenizer fitting (token to index number)
   tokens = tokenized_corpus.apply(lambda i: i.split())
   tokenizer.fit_on_texts(tokens)

   # build vocabulary
   tokenizer.word_index = {word:index for word, index in tokenizer.word_index.items() if index <= maximum_word_num}
   vocabulary = tokenizer.word_index

   # texts_to_sequences changes words into indexes
   return vocabulary, tokenizer.texts_to_sequences(tokens)

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

Thanks @lyeoni , that is exactly the part I don't get:
How do I use the tokenizer on a new text (e.g. passed from the command prompt)?
The code from data_loader.py loads the corpus and fits the tokenizer, and I altered the code a bit to store the tokenizer for later access: self.tokenizer = tokenizer

But using the tokenizer to transform a new text (hence the text = input("Say something: ") code) into something the model would understand is the troublesome part. Can you show a sample for that, please?

from nlp-tutorial.

davidniki02 avatar davidniki02 commented on July 22, 2024

@lyeoni, I think I got it right this time:

from keras.models import load_model
from nltk.tokenize.moses import MosesTokenizer
import keras.preprocessing.text as kpt
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
import numpy as np
import pandas as pd
import pickle
import data_loader

# load model architecture
from keras.models import model_from_json

# loading
model = load_model('model_20190519141343.h5')

handle = open('tokenizer.pickle', 'rb')
tk = pickle.load(handle)

loader = data_loader.DataLoader("corpus.tk.txt", "corpus.tk.vec.txt", "News_Category_Dataset_v2.json")
loader.load_cat()
print(loader.category_dict)
while 1:
	text = input("Say something: ")
	
	if len(text) == 0:
		break

	seq = tk.texts_to_sequences([text.strip()])
	x = pad_sequences(seq, 50)
	print(x)
	prediction = model.predict(x)
	print(prediction)
	cls = np.argmax(prediction)
	
	print(cls)
	print(loader.category_dict[cls])

That being said, it seems the predictions get really off sometimes. I have trained them on the headlines which yields a higher accuracy than summaries (80%) but when tested against something like "Facebook Accused Of Reading Texts And Accessing Microphones In Lawsuit" (which is even in the News dataset) the answer is "POLITICS"

What results do you get? How accurate is the model?

from nlp-tutorial.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.