Code Monkey home page Code Monkey logo

kashgari's Introduction

Kashgari

Pypi Python version Travis (.com) branch FOSSA Status Issues Contributions welcome

Simple and powerful NLP framework, build your state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks.

Kashgare is:

  • Human-friendly. Kashgare's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgare allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Keras based. Kashgare builds directly on Keras, making it easy to train your models and experiment with new approaches using different embeddings and model structure.
  • Easy to fine-tune. Kashgare build-in pre-trained BERT and Word2vec embedding models, which makes it very simple to fine-tune your model based on this embeddings.
  • Fully scalable. Kashgare provide a simple, fast, and scalable environment for fast experimentation.

Feature List

  • Embedding support
    • Classic word2vec embedding
    • BERT embedding
    • GPT-2 embedding
  • Sequence(Text) Classification Models
    • CNNModel
    • BLSTMModel
    • CNNLSTMModel
    • AVCNNModel
    • KMaxCNNModel
    • RCNNModel
    • AVRNNModel
    • DropoutBGRUModel
    • DropoutAVRNNModel
  • Sequence(Text) Labeling Models (NER, PoS)
    • CNNLSTMModel
    • BLSTMModel
    • BLSTMCRFModel
  • Model Training
  • Model Evaluate
  • GPU Support / Multi GPU Support
  • Customize Model

Performance

Task Language Dataset Score Detail
Named Entity Recognition Chinese People's Daily Ner Corpus 92.20 (F1) 基于 BERT 的中文命名实体识别

Roadmap

Tutorials

Here is a set of quick tutorials to get you started with the library:

There are also articles and posts that illustrate how to use Kashgari:

Quick start

Requirements and Installation

The project is based on Keras 2.2.0+ and Python 3.6+, because it is 2019 and type hints is cool.

pip install kashgari
# CPU
pip install tensorflow==1.12.0
# GPU
pip install tensorflow-gpu==1.12.0

Example Usage

lets run a text classification with CNN model over SMP 2017 ECDT Task1.

>>> from kashgari.corpus import SMP2017ECDTClassificationCorpus
>>> from kashgari.tasks.classification import CNNLSTMModel

>>> x_data, y_data = SMP2017ECDTClassificationCorpus.get_classification_data()
>>> x_data[0]
['你', '知', '道', '我', '几', '岁']
>>> y_data[0]
'chat'

# provided classification models `CNNModel`, `BLSTMModel`, `CNNLSTMModel` 
>>> classifier = CNNLSTMModel()
>>> classifier.fit(x_data, y_data)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 10, 100)           87500     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 10, 32)            9632      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                3232      
=================================================================
Total params: 153,564
Trainable params: 153,564
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
 1/35 [..............................] - ETA: 32s - loss: 3.4652 - acc: 0.0469

... 

>>> x_test, y_test = SMP2017ECDTClassificationCorpus.get_classification_data('test')
>>> classifier.evaluate(x_test, y_test)
              precision    recall  f1-score   support
         
        calc       0.75      0.75      0.75         8
        chat       0.83      0.86      0.85       154
    contacts       0.54      0.70      0.61        10
    cookbook       0.97      0.94      0.95        89
    datetime       0.67      0.67      0.67         6
       email       1.00      0.88      0.93         8
         epg       0.61      0.56      0.58        36
      flight       1.00      0.90      0.95        21
...

Run with GPT-2 Embedding

from kashgari.embeddings import GPT2Embedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

gpt2_embedding = GPT2Embedding('<path-to-gpt-model-folder>', sequence_length=30)                                 
model = CNNLSTMModel(gpt2_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Bert Embedding

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = BERTEmbedding('<bert-model-folder>', sequence_length=30)                                   
model = CNNLSTMModel(bert_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Word2vec Embedding

from kashgari.embeddings import WordEmbeddings
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = WordEmbeddings('sgns.weibo.bigram', sequence_length=30)                                  
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Support for Training on Multiple GPUs

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel

train_x, train_y = prepare_your_classification_data()

# build model with embedding
bert_embedding = BERTEmbedding('bert-large-cased', sequence_length=128)
model = CNNLSTMModel(bert_embedding)

# or without pre-trained embedding
model = CNNLSTMModel()

# Build model with your corpus
model.build_model(train_x, train_y)

# Add multi gpu support
model.build_multi_gpu_model(gpus=8)

# Train, 256 / 8 = 32 samples for every GPU per batch
model.fit(train_x, train_y, batch_size=256)

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with the contributor guidelines and then check these open issues for specific tasks.

Reference

This library is inspired by and references following frameworks and papers.

License

FOSSA Status

kashgari's People

Contributors

alexwwang avatar bradfora avatar brikerman avatar fossabot avatar haoyuhu avatar heklis avatar nirantk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.