Context-Dependent Confusion Rules for OCR Post-Processing

Multiple Sequence Alignment [x] <a href="https://github.com/eddiea

Statistical Language Modeling for Historical Documents using Weighted Finite-State Transducers and Long Short-Term Memory about awesome-ocr HOT 5 CLOSED

wanghaisheng commented on August 15, 2024

Statistical Language Modeling for Historical Documents using Weighted Finite-State Transducers and Long Short-Term Memory

from awesome-ocr.

Comments (5)

wanghaisheng commented on August 15, 2024

error model for OCR

Context-dependent confusion rules

从识别结果中的OCR confusions提取规则，并利用 the Levenshtein edit distance algorithm转换成编辑操作 edit operations, e.g., insertions, deletions, and substitutions
the edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using WFSTs. The context-dependent rules assist the language model to find the best candidate corrections. They avoid the calculations that occur in searching the language model and they also make the language model able to correct incorrect words by using context- dependent confusion rules. The context-dependent error model is applied on the university of Washington (UWIII) dataset and the Nastaleeq script in Urdu dataset. It improves the OCR results from an error rate of 1.14% to an error rate of 0.68%. It performs better than the state-of-the-art single rule-based which returns an error rate of 1.0%.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

a new, simple, fast, and accurate system for generating correspondences between real scanned historical books and their transcriptions. The alignment has many chal- lenges, first, the transcriptions might have different modifications, and layout variations than the original book. Second, the recognition of the historical books have misrecognition, and segmen- tation errors, which make the alignment more difficult especially the line breaks, and pages will not have the same correspondences. Adapted WFSTs are designed to represent the transcrip- tion. The WFSTs process Fraktur ligatures and adapt the transcription with a hyphenations model that allows the alignment with respect to the varieties of the hyphenated words in the line breaks of the OCR documents. In this work, several approaches are implemented to be used for the alignment such as: text-segments, page-wise, and book-wise approaches. The approaches are evaluated on German calligraphic (Fraktur) script historical documents dataset from “Wan- derungen durch die Mark Brandenburg” volumes (1862-1889). The text-segmentation approach returns an error rate of 2.33% without using a hyphenation model and an error rate of 2.0% using a hyphenation model. Dehyphenation methods are presented to remove the hyphen from the transcription. They provide the transcription in a readable and reflowable format to be used for alignment purposes. We consider the task as classification problem and classify the hyphens from the given patterns as hyphens for line breaks, combined words, or noise. The methods are applied on clean and noisy transcription for different languages. The Decision Trees classifier returns better performance on UWIII dataset and returns an accuracy of 98%. It returns 97% on Fraktur script.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

a deep investigation has been done on constructing high-performance language modeling for improving the recognition systems. A new method to construct a language model using LSTM is designed to correct OCR results. The method is applied on UWIII and Urdu script. The LSTM approach outperforms the state-of-the-art, especially for unseen tokens during training. On the UWIII dataset, the LSTM returns reduction in OCR error rates from 1.14% to 0.48%. On the Nastaleeq script in Urdu dataset, the LSTM reduces the error rate from 6.9% to 1.58%.

https://github.com/dansoutner/LSTM
https://github.com/sherjilozair/char-rnn-tensorflow
https://github.com/hunkim/word-rnn-tensorflow
https://github.com/mbartoli/docker-char-rnn
https://github.com/michaelcapizzi/NN-NLP
https://github.com/karpathy/char-rnn

https://github.com/jcjohnson/torch-rnn

from awesome-ocr.

wanghaisheng commented on August 15, 2024

the integration of multiple recognition outputs can give higher performance than a single recognition system. Therefore, a new method for combining the results of OCR systems is explored using WFSTs and LSTM. It uses multiple OCR outputs and votes for the best output to improve the OCR results. It performs better than the ISRI tool, Pairwise of Multiple Se- quence and it helps to improve the OCR results. The purpose is to provide correct transcription so that it can be used for digitizing books, linguistics purposes, N-grams, and part-of-speech tagging. The method consists of two alignment steps. First, two recognition systems are aligned using WFSTs. The transducers are designed to be more flexible and compatible with the dif- ferent symbols in line and page breaks to avoid the segmentation and misrecognition errors. The LSTM model then is used to vote the best candidate correction of the two systems and improve the incorrect tokens which are produced during the first alignment. The approaches are evaluated on OCRs output from the English UWIII and historical German Fraktur dataset which are obtained from state-of-the-art OCR systems. The Experiments show that the error rate of ISRI-Voting is 1.45%, the error rate of the Pairwise of Multiple Sequence is 1.32%, the error rate of the Line-to-Page alignment is 1.26% and the error rate of the LSTM approach has the best performance with 0.40%.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

Multiple Sequence Alignment [x]

https://github.com/eddieantonio/isri-ocr-evaluation-tools

heuristic search [x]

Line-to-Page alignment with edit rules using WFSTs

edit rules are based on the edit operations: insertion, deletion, and substitution.
an approach is designed using RNN with LSTM to predict these types of errors, and to solve the mentioned problems as shown in Figure 7.1. A new novel method is designed to normalize the size of the strings for the LSTM alignment. The LSTM returns best voting, especially when the heuristic approaches are unable to vote amongst various OCR engines. LSTM predicts the correct characters, even if the OCR could not produce these characters in the outputs.

A Line-Page approach is designed to avoid the difference in the lines’ order for both OCR systems, where each line in OCR’s page is represented as a line in the WFST. The OCR’s page is represented as parallel lines for the page WFST. This approach aligns the line of the first OCR with each line in the page WFST of the second OCR. Best matches between the first OCR and the second OCR systems are represented in a composed graph. From the composed graph, the best valid path is then chosen.

C1 Line-to-Page Alignment approach as described in Section 7.2

(a) Solving the problem of segmentation and various line breaks by aligning line to page.
(b) Solving the problem of the recognition error by using edit operations as rules. The usual edit operations such as insertion, deletion, and substitution are used.
(c) Flexible combination and adaptation using WFST.
(d) Improving recognition results using multiple outputs from different recognition sys- tems without using dictionary.

C2 LSTM approach as described in Section 7.3

(a) Normalization of the strings of different OCR outputs.
(b) Yields prediction of unknown strings and can vote for the best output amongst various errors.
(c) Solves all problems, which are not solved by previous approaches. Heuristic ap- proaches are unable to return characters which did not appear in the OCR results. They also fail to vote for a correct character, if all the OCR systems provide the misrecognized versions of this character.
(d) Flexible and adaptable approach.
(e) Improved the recognition results by combining the output of many OCRs.
(f) Approaches in C1 and C2 are language independent.
(g) Performance is better than the state-of-the-art [RJN96,WYM13].
(h) The approaches are evaluated on OCRs output from the English UWIII and historical German Fraktur dataset which are obtained from state-of-the-art OCR systems.
(i) Experiments show that the error rate of ISRI-Voting is 1.45%, the Pairwise of Multi- ple Sequence is 1.32%, the Line-to-Page alignment is 1.26%, and the LSTM approach has the best performance with 0.40% for English script from the UWIII dataset.

In Section 7.1, the state-of-the-art Pairwise of Multiple Sequence Alignment and ISRI OCR voting tools are described. Section 7.2 shows the contributed method and the constructing of Line-to-Page Alignment using WFSTs. Section 7.3 explains the second novel contributed approach using alignment and LSTMs. In Section 7.3.1, the newly designed Character-Epsilon alignment for size normalization is shown. Section 7.3.2 describes the string encoding and Section 7.3.3 shows the configuration of the LSTM network. Section 7.4 presents the experimental results. Section 7.5 summarizes conclusion.

from awesome-ocr.

Statistical Language Modeling for Historical Documents using Weighted Finite-State Transducers and Long Short-Term Memory about awesome-ocr HOT 5 CLOSED

Comments (5)

error model for OCR

Multiple Sequence Alignment [x]

heuristic search [x]

Line-to-Page alignment with edit rules using WFSTs

C1 Line-to-Page Alignment approach as described in Section 7.2

C2 LSTM approach as described in Section 7.3

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent