intuitionengineeringteam / chars2vec Goto Github PK

View Code? Open in Web Editor NEW

171.0 5.0 37.0 7.75 MB

Character-based word embeddings model based on RNN for handling real world texts

License: Apache License 2.0

Python 100.00%

natural-language-processing natural-language-understanding language-model embeddings

chars2vec's Introduction

chars2vec

Character-based word embeddings model based on RNN

Chars2vec library could be very useful if you are dealing with the texts containing abbreviations, slang, typos, or some other specific textual dataset. Chars2vec language model is based on the symbolic representation of words – the model maps each word to a vector of a fixed length. These vector representations are obtained with a custom neural network while the latter is being trained on pairs of similar and non-similar words. This custom neural net includes LSTM, reading sequences of characters in words, as its part. The model maps similarly written words to proximal vectors. This approach enables creation of an embedding in vector space for any sequence of characters. Chars2vec models does not keep any dictionary of embeddings, but generates embedding vectors inplace using pretrained model.

There are pretrained models of dimensions 50, 100, 150, 200 and 300 for the English language. The library provides convenient user API to train a model for an arbitrary set of characters. Read more details about the architecture of Chars2vec: Character-based language model for handling real world texts with spelling errors and human slang in Hacker Noon.

Model available for Python 2.7 and 3.0+.

Installation

1. Build and install from source

Download project source and run in your command line

>> python setup.py install

2. Via pip

Run in your command line

>> pip install chars2vec

Usage

Function chars2vec.load_model(str path) initializes the model from directory and returns chars2vec.Chars2Vec object. There are 5 pretrained English model with dimensions: 50, 100, 150, 200 and 300. To load this pretrained models:

import chars2vec

# Load Inutition Engineering pretrained model
# Models names: 'eng_50', 'eng_100', 'eng_150', 'eng_200', 'eng_300'
c2v_model = chars2vec.load_model('eng_50')

Method chars2vec.Chars2Vec.vectorize_words(words) returns numpy.ndarray of shape (n_words, dim) with word embeddings.

words = ['list', 'of', 'words']

# Create word embeddings
word_embeddings = c2v_model.vectorize_words(words)

Training

Function chars2vec.train_model(int emb_dim, X_train, y_train, model_chars) creates and trains new chars2vec model and returns chars2vec.Chars2Vec object.

Parameter emb_dim is a dimension of the model.

Parameter X_train is a list or numpy.ndarray of word pairs. Parameter y_train is a list or numpy.ndarray of target values that describe the proximity of words.

Training set (X_train, y_train) consists of pairs of "similar" and "not similar" words; a pair of "similar" words is labeled with 0 target value, and a pair of "not similar" with 1.

Parameter model_chars is a list of chars for the model. Characters which are not in the model_chars list will be ignored by the model.

Read more about chars2vec training and generation of training dataset in article about chars2vec.

Function chars2vec.save_model(c2v_model, str path_to_model) saves the trained model to the directory.

import chars2vec

dim = 50
path_to_model = 'path/to/model/directory'

X_train = [('mecbanizing', 'mechanizing'), # similar words, target is equal 0
           ('dicovery', 'dis7overy'), # similar words, target is equal 0
           ('prot$oplasmatic', 'prtoplasmatic'), # similar words, target is equal 0
           ('copulateng', 'lzateful'), # not similar words, target is equal 1
           ('estry', 'evadin6'), # not similar words, target is equal 1
           ('cirrfosis', 'afear') # not similar words, target is equal 1
          ]

y_train = [0, 0, 0, 1, 1, 1]

model_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.',
               '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<',
               '=', '>', '?', '@', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
               'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
               'x', 'y', 'z']

# Create and train chars2vec model using given training data
my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)

# Save your pretrained model
chars2vec.save_model(my_c2v_model, path_to_model)

# Load your pretrained model 
c2v_model = chars2vec.load_model(path_to_model)

Full code examples for usage and training models see in example_usage.py and example_training.py files.

chars2vec's People

Contributors

Stargazers

Watchers

chars2vec's Issues

Using vectorize_words leads to AttributeError

Hello,

I am using keras ==2.9.0 and tensorflow ==2.9.1. I'm using pretrained eng_50 model like so --

c2v_model = chars2vec.load_model('eng_50')

However, when I use the vectorize_words method on my list of strings, I get the following AttributeError:

c2v_model.vectorize_words(std_job_list)

AttributeError Traceback (most recent call last)
in
----> 1 c2v_model.vectorize_words(std_job_list)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.7/site-packages/chars2vec/model.py in vectorize_words(self, words, maxlen_padseq)
150 list_of_embeddings.append(np.array(current_embedding))
151
--> 152 embeddings_pad_seq = keras.preprocessing.sequence.pad_sequences(list_of_embeddings, maxlen=maxlen_padseq)
153 new_words_vectors = self.embedding_model.predict([embeddings_pad_seq])
154

AttributeError: module 'keras.preprocessing.sequence' has no attribute 'pad_sequences'

I'm not sure if there's a specific version I need to be using for keras/tensorflow. or if I'm missing something separate Any advice on this would be appreciated! Thanks

Training the model without target values

Hi, I came across your article on hackernoon.com (Chars2vec: character-based language model for handling real world texts with spelling errors and…). It is very interesting.

I am wondering if I can train the chars2vec model without the target (unsupervised model). Basically, I have a list of names and want to vectorize it. I have been using TF-IDF from sklearn to vectorize these names. Then, I would do some plotting, clustering, cosine similarity on these vector.

Thanks.

Is it possible to fine-tune one of your pretrained models?

I've only seen the training example with which we can create a new model from scratch with our training data. What if we wanted to make one of the pretrained models stronger to our domain data?

Chars2vec for text classification

Thanks for this Wonderfull library.

Is any examples are available to use Chars2vec in text classification ?

OutOfMemory Error on SageMaker with 6 Million Data

When attempting to run the code on SageMaker with a dataset of 6 million records and using the instance type ml.m5.4xlarge, I encounter an "OutOfMemory" issue, resulting in the process being killed by SIGKILL (signal 9). This occurs due to high memory consumption during the execution of the code. However, it's working fine in local mac m1 with tensorflow-macos package

Triplet Loss

Just a query, have you tried triplet loss or lossless triplet loss as I think that would produce better embeddings as we are providing fewer examples and the clusters formed will be visually better.
Looking for this feature if it is the right approach.

Also, this is will only handle non-word error and won't work for real word error if error-correction is your aim.

Although, I liked the approach and guess it is similar to fastText from Facebook, here you are using RNN whereas they use n-grams. As I am a beginner considering all this, can you guide me with which method is preferable and reason?

Begging for the training data

the use case of self.model attribute in the Chars2Vec class?

I quite new to this field and I just read the source code, but I am quite confused a little bit with the actual role self.model the Chars2Vec class. Can I anyone explain for me how could I use the self.model or what is the role of it there? Because I don't see any method in Chars2Vec calling this param (except the fit method).

43        model_output = keras.layers.Dense(1, activation='sigmoid')(x)
44
45        self.model = keras.models.Model(inputs=[model_input_1, model_input_2], outputs=model_output)
46        self.model.compile(optimizer='adam', loss='mae')

not able to reproduce the training procedure

I referred to the website where I found a piece of code to reproduce training of Char2Vec but it produces an error, can you please help. If it works I want to train it for my purpose on the german text.
website: https://hackernoon.com/chars2vec-character-based-language-model-for-handling-real-world-texts-with-spelling-errors-and-a3e4053a147d

code:

import chars2vec

dim = 50

path_to_model = 'path/to/model/directory'

X_train = [('mecbanizing', 'mechanizing'), # similar words, target is equal 0
           ('dicovery', 'dis7overy'), # similar words, target is equal 0
           ('prot$oplasmatic', 'prtoplasmatic'), # similar words, target is equal 0
           ('copulateng', 'lzateful'), # not similar words, target is equal 1
           ('estry', 'evadin6'), # not similar words, target is equal 1
           ('cirrfosis', 'afear') # not similar words, target is equal 1
          ]
y_train = [0, 0, 0, 1, 1, 1]
model_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.',
               '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<',
               '=', '>', '?', '@', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
               'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
               'x', 'y', 'z']
my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)
chars2vec.save_model(my_c2v_model, path_to_model)
words = ['list', 'of', 'words']
c2v_model = chars2vec.load_model(path_to_model)
word_embeddings = c2v_model.vectorize_words(words)

Error:

ValueError                                Traceback (most recent call last)
<ipython-input-14-23a592d19001> in <module>
      1 # Create and train chars2vec model using given training data
----> 2 my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)
      3 
      4 # Save pretrained model
      5 chars2vec.save_model(my_c2v_model, path_to_model)

C:\ProgramData\Anaconda3\lib\site-packages\chars2vec\model.py in train_model(emb_dim, X_train, y_train, model_chars, max_epochs, patience, validation_split, batch_size)
    235 
    236     targets = [float(el) for el in y_train]
--> 237     c2v_model.fit(X_train, targets, max_epochs, patience, validation_split, batch_size)
    238 
    239     return c2v_model

C:\ProgramData\Anaconda3\lib\site-packages\chars2vec\model.py in fit(self, word_pairs, targets, max_epochs, patience, validation_split, batch_size)
    105         x_2_pad_seq = keras.preprocessing.sequence.pad_sequences(x_2)
    106 
--> 107         self.model.fit([x_1_pad_seq, x_2_pad_seq], targets,
    108                        batch_size=batch_size, epochs=max_epochs,
    109                        validation_split=validation_split,

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\engine\training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1037       # `Tensor` and `NumPy` input.
   1038       (x, y, sample_weight), validation_data = (
-> 1039           data_adapter.train_validation_split(
   1040               (x, y, sample_weight), validation_split=validation_split))
   1041 

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\engine\data_adapter.py in train_validation_split(arrays, validation_split)
   1372   unsplitable = [type(t) for t in flat_arrays if not _can_split(t)]
   1373   if unsplitable:
-> 1374     raise ValueError(
   1375         "`validation_split` is only supported for Tensors or NumPy "
   1376         "arrays, found following types in the input: {}".format(unsplitable))

ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found following types in the input: [<class 'float'>, <class 'float'>, <class 'float'>, <class 'float'>, <class 'float'>, <class 'float'>]

I tried but even then not resolved

y_train = [0, 0, 0, 1, 1, 1]
y_train = np.array(y_train)

chars2vec.load_model('eng_50')- got an unexpected keyword argument 'maximum_iterations'

import keras
import chars2vec
c2v_model = chars2vec.load_model('eng_50')

Am facing below error while load model

`/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in rnn(step_function, inputs, initial_states, go_backwards, mask, constants, unroll, input_length)
3009 parallel_iterations=32,
3010 swap_memory=True,
-> 3011 maximum_iterations=input_length)
3012 last_time = final_outputs[0]
3013 output_ta = final_outputs[1]

TypeError: while_loop() got an unexpected keyword argument 'maximum_iterations'

Release the new version

Release the new version after last tensorflow version update

Error when installing through pip

I tried to install chars2vec by pip:

pip3 install chars2vec

But It shows

error: can't copy 'chars2vec/trained_models/eng_200': doesn't exist or not a regular file

Does anyone have the same issue?

The full log is:

v-chlou@SRGSSD-06:~$ pip3 install chars2vec
Collecting chars2vec
  Using cached https://files.pythonhosted.org/packages/04/0a/8c327aae23e0532d239ec7b30446aca765eb5d9547b4c4b09cdd82e49797/chars2vec-0.1.7.tar.gz
Building wheels for collected packages: chars2vec
  Running setup.py bdist_wheel for chars2vec ... error
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-_cllchv2/chars2vec/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpeqgl8qr_pip-wheel- --python-tag cp35:
  Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
  You are using pip version 8.1.1, however version 19.3.1 is available.
  You should consider upgrading via the 'pip install --upgrade pip' command.
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib
  creating build/lib/chars2vec
  copying chars2vec/__init__.py -> build/lib/chars2vec
  copying chars2vec/model.py -> build/lib/chars2vec
  running egg_info
  writing top-level names to chars2vec.egg-info/top_level.txt
  writing dependency_links to chars2vec.egg-info/dependency_links.txt
  writing chars2vec.egg-info/PKG-INFO
  warning: manifest_maker: standard file '-c' not found
  
  reading manifest file 'chars2vec.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no files found matching '*' under directory 'chars2vec/'
  writing manifest file 'chars2vec.egg-info/SOURCES.txt'
  copying chars2vec/.DS_Store -> build/lib/chars2vec
  creating build/lib/chars2vec/__pycache__
  copying chars2vec/__pycache__/__init__.cpython-36.pyc -> build/lib/chars2vec/__pycache__
  copying chars2vec/__pycache__/model.cpython-36.pyc -> build/lib/chars2vec/__pycache__
  creating build/lib/chars2vec/trained_models
  copying chars2vec/trained_models/.DS_Store -> build/lib/chars2vec/trained_models
  creating build/lib/chars2vec/trained_models/eng_100
  copying chars2vec/trained_models/eng_100/model.pkl -> build/lib/chars2vec/trained_models/eng_100
  copying chars2vec/trained_models/eng_100/weights.h5 -> build/lib/chars2vec/trained_models/eng_100
  creating build/lib/chars2vec/trained_models/eng_150
  copying chars2vec/trained_models/eng_150/model.pkl -> build/lib/chars2vec/trained_models/eng_150
  copying chars2vec/trained_models/eng_150/weights.h5 -> build/lib/chars2vec/trained_models/eng_150
  creating build/lib/chars2vec/trained_models/eng_200
  copying chars2vec/trained_models/eng_200/model.pkl -> build/lib/chars2vec/trained_models/eng_200
  copying chars2vec/trained_models/eng_200/weights.h5 -> build/lib/chars2vec/trained_models/eng_200
  creating build/lib/chars2vec/trained_models/eng_300
  copying chars2vec/trained_models/eng_300/model.pkl -> build/lib/chars2vec/trained_models/eng_300
  copying chars2vec/trained_models/eng_300/weights.h5 -> build/lib/chars2vec/trained_models/eng_300
  creating build/lib/chars2vec/trained_models/eng_50
  copying chars2vec/trained_models/eng_50/model.pkl -> build/lib/chars2vec/trained_models/eng_50
  copying chars2vec/trained_models/eng_50/weights.h5 -> build/lib/chars2vec/trained_models/eng_50
  error: can't copy 'chars2vec/trained_models/eng_200': doesn't exist or not a regular file