Code Monkey home page Code Monkey logo

post_ocr_correction's Introduction

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

This is the source code for the paper Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models by Ramirez-Orta et al., (2021).

Abstract

In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample-and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble.

Usage

Contents

  • The data folder contains the model parameters and architecture specifications to reconstruct the models for each language (this is created after running download_data.py).
  • The evaluate folder contains the scripts to reproduce the evaluation results from the paper.
  • The lib folder contains the code to use the sequence-to-sequence models to correct very long strings of characters, to compute the metrics used in the paper and the source code of the sequence-to-sequence models.
  • The notebooks folder contains the Jupyter Notebooks to build the datasets required to train the sequence-to-sequence models, as well as the exploratory data analysis of the data from the ICDAR 2019 competition.
  • The tests folder contains scripts to test the installation of the repository.
  • The train folder contains the scripts with hyper-parameters to train the models shown in the paper.
  • The tutorials folder contains use cases on how to use the library.

Installation

git clone https://github.com/jarobyte91/post_ocr_correction.git
cd post_ocr_correction
pip install .

To download the datasets and models

python download_data.py

To reproduce the results from the paper

pip install -r requirements.txt
cd notebooks

To install the Python package

pip install post_ocr_correction

Contribute & Support

License

The project is licensed under the MIT License.

post_ocr_correction's People

Contributors

jarobyte91 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

post_ocr_correction's Issues

For the inference, how can we do the alignment?

When we do inference, we do not have aligned input. If we input "Dcmen", the output should be "Documen", which will be longer than the maximum length allowed. Could you please answer this question? Thank!

vocabulary not found

Hello,

Thank you very much for the published work, it's very interesting.

When trying to use the lib part:

# string is the string of characters to correct
    # model is a PyTorch sequence-to-sequence model
    # vocabulary is a correspondence between integers and tokens 
    # (see the preprocessing notebook in the notebooks folder for each language)
    
    from ocr_correction import correct_by_disjoint_window, correct_by_sliding_window
    corrected_string = correct_by_disjoint_window(string, model, vocabulary)
    # or
    corrected_string = correct_by_sliding_window(string, model, vocabulary)
    corrected_string.replace("@", "") # to remove padding character

The vocabulary is absent. After checking the preprocessing notebook (for french language), I noticed that the associated pickles are not uploaded.

Could you please provide the vocabulary so that I can test the approach ?

Thanks in advance,

Discussion

I have noticed that after using your method to correct the text, the CER in English is below the baseline. Why is this happening?
image
Also, as you mentioned in the section of Discussion, the weight function doesn't seem to be very useful. Have you considered using other weight functions, such as introducing word frequency?
image

Dealing with long text

Hi, may I ask a question about dealing with tong texts?

I try replicating the "Load one of the pre-trained models" in README.md to correct the long OCR text. My code is below:

text = 'here is the long text ... ' text = correct_punctuation_error(text) new_source = [list(text)] X_new = source_index.text2tensor(new_source) predictions, log_probabilities = seq2seq.beam_search( model, X_new, progress_bar = 0) just_beam = target_index.tensor2text(predictions[:, 0, :])[0] just_beam = re.sub(r"<START>|<PAD>|<UNK>|<END>.*", "", just_beam)

However, here shows an error:
image

How to fix it?

Thank you so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.