ferdinandzhong / punctuator Goto Github PK

View Code? Open in Web Editor NEW

47.0 3.0 7.0 39.78 MB

A small seq2seq punctuator tool based on DistilBERT

License: Apache License 2.0

Makefile 0.64% Python 80.72% Jupyter Notebook 18.64%

bert nlp seq2seq punctuation deep-learning pytorch bert-ner chinese-nlp

punctuator's Introduction

Distilbert-punctuator

Introduction

Distilbert-punctuator is a python package provides a bert-based punctuator (fine-tuned model of pretrained huggingface DistilBertForTokenClassification) with following three components:

data process: funcs for processing user's data to prepare for training. If user perfer to fine-tune the model with his/her own data.
training: training pipeline and evaluation. User can fine-tune his/her own punctuator with the pipeline
inference: easy-to-use interface for user to use trained punctuator.
If user doesn't want to train a punctuator himself/herself, two pre-fined-tuned model from huggingface model hub
- Qishuai/distilbert_punctuator_en 📎 Model details
- Qishuai/distilbert_punctuator_zh 📎 Model details
model examples in huggingface web page.
- English model
- Simplified Chinese model

Installation

Installing the package from pypi: pip install distilbert-punctuator for directly usage of punctuator.
Installing the package with option to do data processing pip install distilbert-punctuator[data_process].
Installing the package with option to train and validate your own model pip install distilbert-punctuator[training]
For development and contribution
- clone the repo
- make install

Data Process

Component for pre-processing the training data. To use this component, please install as pip install distilbert-punctuator[data_process]

The package is providing a simple pipeline for you to generate NER format training data.

Example

examples/data_sample.py

Train

Component for providing a training pipeline for fine-tuning a pretrained DistilBertForTokenClassification model from huggingface. The latest version has the implementation of R-Drop enhanced training. R-Drop github repo Paper of R-Drop

Example

examples/english_train_sample.py

Training_arguments:

Arguments required for the training pipeline.

basic arguments
- training_corpus(List[List[str]]): list of sequences for training, longest sequence should be no longer than pretrained LM # noqa: E501
- validation_corpus(List[List[str]]): list of sequences for validation, longest sequence should be no longer than pretrained LM # noqa: E501
- training_tags(List[List[int]]): tags(int) for training
- validation_tags(List[List[int]]): tags(int) for validation
- model_name_or_path(str): name or path of pre-trained model
- tokenizer_name(str): name of pretrained tokenizer
training arguments
- epoch(int): number of epoch
- batch_size(int): batch size
- model_storage_dir(str): fine-tuned model storage path
- label2id(Dict): the tags label and id mapping
- early_stop_count(int): after how many epochs to early stop training if valid loss not become smaller. default 3 # noqa: E501
- gpu_device(int): specific gpu card index, default is the CUDA_VISIBLE_DEVICES from environ
- warm_up_steps(int): warm up steps.
- r_drop(bool): whether to train with r-drop
- r_alpha(int): alpha value for kl divengence in the loss, default is 0
- plot_steps(int): record training status to tensorboard among how many steps
- tensorboard_log_dir(Optional[str]): the tensorboard logs output directory, default is "runs"
model arguments
- addtional_model_config(Optional[Dict]): additional configuration for model

You can also train your own NER models with the trainer provided in this repo. The example can be found in notebooks/R-drop NER.ipynb

Evaluation

Validation of fine-tuned model

Example

examples/train_sample.py

Validation_arguments:

evaluation_corpus(List[List[str]]): list of sequences for evaluation, longest sequence should be no longer than pretrained LM's max_position_embedding(512)
evaluation_tags(List[List[int]]): tags(int) for evaluation (the GT)
model_name_or_path(str): name or path of fine-tuned model
tokenizer_name(str): name of tokenizer
batch_size(int): batch size
label2id(Optional[Dict]): label2id. Default one is from model config. Pass in this argument if your model doesn't have a label2id inside config
gpu_device(int): specific gpu card index, default is the CUDA_VISIBLE_DEVICES from environ

Inference

Component for providing an inference interface for user to use punctuator.

Architecture

 +----------------------+              (child process)
 |   user application   |             +-------------------+
 +                      + <---------->| punctuator server |
 |   +inference object  |             +-------------------+
 +----------------------+

The punctuator will be deployed in a child process which communicates with main process through pipe connection. Therefore user can initialize an inference object and call its punctuation function when needed. The punctuator will never block the main process unless doing punctuation. There is a graceful shutdown methodology for the punctuator, hence user dosen't need to worry about the shutting-down.

Example

examples/inference_sample.py

Inference_arguments

Arguments required for the inference pipeline.

model_name_or_path(str): name or path of pre-trained model
tokenizer_name(str): name of pretrained tokenizer

tag2punctuator(Dict[str, tuple]): tag to punctuator mapping. dbpunctuator.utils provides two default mappings for English and Chinese

NORMAL_TOKEN_TAG = "O"
DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP = {
    NORMAL_TOKEN_TAG: ("", False),
    "COMMA": (",", False),
    "PERIOD": (".", True),
    "QUESTIONMARK": ("?", True),
    "EXLAMATIONMARK": ("!", True),
}

DEFAULT_CHINESE_TAG_PUNCTUATOR_MAP = {
    NORMAL_TOKEN_TAG: ("", False),
    "C_COMMA": ("，", False),
    "C_PERIOD": ("。", True),
    "C_QUESTIONMARK": ("? ", True),
    "C_EXLAMATIONMARK": ("! ", True),
    "C_DUNHAO": ("、", False),
}

for own fine-tuned model with different tags, pass in your own mapping

tag2id_storage_path(Optional[str]): tag2id storage path. Default one is from model config. Pass in this argument if your model doesn't have a tag2id inside config

punctuator's People

Contributors

Stargazers

Watchers

Forkers

raghavendrajain ishine crixue easternstarlinz photkey manbaaaa cafew

punctuator's Issues

So many examples but no example for punctuate

Hello. I want to punctate big chunk of text. E.g. like below. How can I do that? Thank you

Could you write a simple python code to punctate text below?

The text is from my lecture video (https://www.youtube.com/watch?v=_nKwisL8dTs) which I am trying to generate subtitles. Whisper does very well but fails to punctate at some parts.

okay sorry about this confusion what I did is when I have forgotten to unpause the video is simply I have coded a test button and the test button is using our original static file cmd and gif file cmd and I also fixed something in gif file cmd which is I have removed the loses command because it was giving an error now they are working I am using a wait for exists so let me show you how it works okay okay let me start test so the first process is started it is taking some time because that image is pretty big then it is starting the other one and now they are generated okay so you see original file is 820 kilobytes and let's see how much did we gain okay so 820 minus 572 over 820 you see 30 percent gain we have in this file it is significant and it has zero difference how can I be so sure about that we can be sure about that with a comparison okay so I am going to only make a single line of single pixel of difference here on this web p file and I will save it as a test on my desktop here as a png so I will name it as test to png okay and then I will save my original file as test png on the desktop here then I will use online comparison website let me show you compare image difference okay there are several pages for that so first try with diff checker diff checker is awesome website believe me okay so when I see check the difference there is a single line of difference here on this image so how they achieve this I wonder yeah so here when I hover and when I zoom in okay like this you see there is a single line a single pixel of difference here and no other differences it is exactly same and let's compare with another website okay online diff so first image and the second image so I will make the fuzziness zero and it will show as a red color okay so on this image there is a single pixel difference here which is what I have made and there is no other red dot okay so I can copy this image to zoom in so you see there is no other red dot because they are exactly same except the single line single pixel that I have made myself so basically we gain 35 percent 30 percent size in this image and on this gift image we gained from minus to 26.9 over this 35 percent you see with on the gift image we gain 35 percent and let's test if they are working or not so this is our WebP GIF and this is our iponic GIF this is original GIF file and this is WebP file they are looking pretty much same to me we can also use some online websites online GIF to WebP there is one website which I have found working very well this one or yeah let's try this I think it was this one so let's open our debug test so here our GIF upload it then you see there is losing compression mixed compression I unmark them and convert the WebP so this website generated a little bit higher kilobyte because probably it is not using the best compression and that's it okay so we are able to properly convert GIF and static PNG and probably GPX as well we haven't tested GPX so let's also test the GPX for example yeah this wallpaper it's pretty big so it will probably take a lot of time okay let's copy and paste this okay so I will remove this probably we don't even need it right now what is the file name it is this I am not sure if it if it can produce better than GPX because GPX is already losing compression as you know okay let's try it so all processes started at the same time because we are not waiting them and they are running right now as they get completed it will close the window and why it takes so long is that we are using the best possible algorithm and let's see the output okay so yes the WebP file is bigger than the original GPX it is because GPX is already losing and when I save this GPX as a PNG let's see the size okay size of the PNG is this we can of course optimize it a little bit more with PNG out win and I am pretty sure there will be still significant difference between PNG version and WebP version this is a software that I have purchased to optimize my PNG files previously but it is not anymore necessary because now we can use WebP format which is much better format okay so this software is single threaded on a single image so it is taking some time it has so many passes okay so the optimized PNG file is 2.53 megabytes and minus 1.52 megabytes over or not this one actually since GPX files are already losing we shouldn't convert them to WebP probably we we cannot we cannot achieve same quality I wonder if there is an losing but no point of converting GPX into WebP let me check that first okay okay same quality for GPX I think we need to have some losing compression probably for GPX compression we need to use some other methodology so let's see which which options we can use okay let's see okay so there is version loses near loses int so we can use near loses for GPX I think okay so which which option should we use I'm not sure I think I will try near loses yeah let's try it with so for that I'm going to have another file it will be for GPX for GPX I'm going to remove loses and change it with near loses with zero and I think I have to remove z9 as well so yeah I have to remove z9 okay let's try this way for GPX okay and this is the file name okay let's test GPX SR or a GPX and let's comment out this is and let's make it like this yeah okay let's see what kind of results we are going to get with GPX command okay so it is done oh wow now we have a better result than original GPX so let's compare two images quality of course I am not expecting them to be same yeah I can see the difference there is already some difference but I am not sure if we have lost some quality or not yeah we have lost some quality as you can see definitely and it is not small as well okay I wonder if it is possible to compress GPX losing quality is this even possible I'm not sure compress GPX okay okay

ferdinandzhong / punctuator Goto Github PK

punctuator's Introduction

Distilbert-punctuator

Introduction

Installation

Data Process

Example

Train

Example

Training_arguments:

basic arguments

training arguments

model arguments

Evaluation

Example

Validation_arguments:

Inference

Architecture

Example

Inference_arguments

punctuator's People

Contributors

Stargazers

Watchers

Forkers

punctuator's Issues

Recommend Projects

Recommend Topics

Recommend Org