jarobyte91 / post_ocr_correction Goto Github PK

View Code? Open in Web Editor NEW

33.0 3.0 6.0 341 KB

Source code for the paper "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models"

Home Page: https://ojs.aaai.org/index.php/AAAI/article/view/21369

License: MIT License

Shell 0.04% Jupyter Notebook 96.22% Python 3.74%

ocr-correction deep-learning sequence-to-sequence

post_ocr_correction's Introduction

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

This is the source code for the paper Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models by Ramirez-Orta et al., (2021).

Abstract

In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample-and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble.

Usage

The data folder contains the model parameters and architecture specifications to reconstruct the models for each language (this is created after running download_data.py).
The evaluate folder contains the scripts to reproduce the evaluation results from the paper.
The lib folder contains the code to use the sequence-to-sequence models to correct very long strings of characters, to compute the metrics used in the paper and the source code of the sequence-to-sequence models.
The notebooks folder contains the Jupyter Notebooks to build the datasets required to train the sequence-to-sequence models, as well as the exploratory data analysis of the data from the ICDAR 2019 competition.
The tests folder contains scripts to test the installation of the repository.
The train folder contains the scripts with hyper-parameters to train the models shown in the paper.
The tutorials folder contains use cases on how to use the library.

Installation

git clone https://github.com/jarobyte91/post_ocr_correction.git
cd post_ocr_correction
pip install .

To download the datasets and models

python download_data.py

To reproduce the results from the paper

pip install -r requirements.txt
cd notebooks

To install the Python package

pip install post_ocr_correction

Contribute & Support

License

The project is licensed under the MIT License.

post_ocr_correction's People

Contributors

Stargazers

Watchers

Forkers

mbencherif jaspreetsinghmaan bo-feng-1024 wenhua-hu pandinosaurus beaujolaisy

post_ocr_correction's Issues

ModuleNotFoundError: No module named 'pytorch_decoding'

I install the project by your instructions. But I can't find the module 'pytorch_decoding'.

For the inference, how can we do the alignment?

When we do inference, we do not have aligned input. If we input "Dcmen", the output should be "Documen", which will be longer than the maximum length allowed. Could you please answer this question? Thank!

Will you able to share the pretrained model?

I am impressed by you paper. Thank you for your great work.
Do you mind to share your pretrained model, so I can play with it.

vocabulary not found

Hello,

Thank you very much for the published work, it's very interesting.

When trying to use the lib part:

# string is the string of characters to correct
    # model is a PyTorch sequence-to-sequence model
    # vocabulary is a correspondence between integers and tokens 
    # (see the preprocessing notebook in the notebooks folder for each language)
    
    from ocr_correction import correct_by_disjoint_window, correct_by_sliding_window
    corrected_string = correct_by_disjoint_window(string, model, vocabulary)
    # or
    corrected_string = correct_by_sliding_window(string, model, vocabulary)
    corrected_string.replace("@", "") # to remove padding character

The vocabulary is absent. After checking the preprocessing notebook (for french language), I noticed that the associated pickles are not uploaded.

Could you please provide the vocabulary so that I can test the approach ?

Thanks in advance,

evaluate_model

from ocr_correction import evaluate_model

Discussion

I have noticed that after using your method to correct the text, the CER in English is below the baseline. Why is this happening?

Also, as you mentioned in the section of Discussion, the weight function doesn't seem to be very useful. Have you considered using other weight functions, such as introducing word frequency?

How to train with other language ?

Thank for your great paper . I want to train with my language. Could you tell me some step prepare dataset to train. Thank a lot

How can I train the model with the dataset in the paper?

If I run the script “Train a model from scratch”, it does not use the dataset in the paper.

I hope to use your dataset to train your model again to get a same trained model with yours. Thanks!

Dealing with long text

Hi, may I ask a question about dealing with tong texts?

I try replicating the "Load one of the pre-trained models" in README.md to correct the long OCR text. My code is below:

text = 'here is the long text ... ' text = correct_punctuation_error(text) new_source = [list(text)] X_new = source_index.text2tensor(new_source) predictions, log_probabilities = seq2seq.beam_search( model, X_new, progress_bar = 0) just_beam = target_index.tensor2text(predictions[:, 0, :])[0] just_beam = re.sub(r"<START>|<PAD>|<UNK>|<END>.*", "", just_beam)

However, here shows an error:

How to fix it?

Thank you so much!