Benchmark Datasets for OCR Numerous character recognition algorithms require sizab

Adnan Ul-Hasan的博士论文-第四章训练数据 about awesome-ocr HOT 7 CLOSED

wanghaisheng commented on August 15, 2024

Adnan Ul-Hasan的博士论文-第四章训练数据

from awesome-ocr.

Comments (7)

wanghaisheng commented on August 15, 2024

4.2 Semi-automated Database Generation
This section describes an approach to automate the process of OCR database prepa- ration from scanned documents. It is believed that the proposed automation will greatly reduce the manual e orts required to develop OCR databases for cursive scripts. The basic idea is to apply ligature-clustering prior to manual labeling. Two pro- totype datasets for Urdu Nastaleeq script have been developed using the proposed technique.
Urdu belongs to the family of cursive scripts where words mainly consist of liga- tures. Ligatures are formed by joining individual characters and the shape of a char- acter in a ligature depends on its position within the word. Moreover, there are dots and diacritics that are associated with certain characters. Each ligature in Urdu is sep- arated from other ligatures or its own diacritics by vertical, horizontal or diagonal (slanted) space. The properties of this script along with the challenges it poses for OCR have been described in Section 3.6. It is assumed in the context of the present section that the smallest unit of the script is a ligature, which may either be a combi- nation of many characters or a single non joinable character. There are around 26,000 ligatures in Urdu Nastaleeq script, and a reasonable database must cover all of them.
The method to semi automatically generate a database for Urdu Nastaleeq starts with binarization as the pre-processing step. Urdu ligatures are then extracted from the text images. These ligatures are then clustered prior to manual labeling of the correct ligatures. The following sub-sections present a detailed view of the proposed method.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.2.1 Preprocessing
Binarization is the only preprocessing step in the proposed method; however, skew detection and correction may be included as further preprocessing steps. Local thresholding technique [SP00] is used for the binarization purpose. Fast implementa- tion of this algorithm proposed by Shafait et al. [SKB08] has been used to speed-up the process. Two parameters, namely local window size and k-parameter, are needed to be set empirically according to the documents. The local window size is taken to be 70 × 70 and k-parameter is set to 0.3.
4.2.2 Ligature Extraction
Ligature extraction may be carried out in two ways: one is to apply ligature extrac- tion algorithm directly on the binarized image, while the second is to extract text- lines before applying ligature extraction. The former is suitable where documents are clean having well-de ned text-line spacing (see Figure 4.1-(a)) and the latter is suitable when text-lines are not separated very well in the documents (Figure 4.1-(b)), and in case of degraded historical documents.
Narrow line separation results in poor connected component analysis; thereby, many ligatures from the text-lines above and below merge together. The decision to apply text-line segmentation is taken on the basis of line-spacing in a particular docu- ment. Ligature extraction is started by applying connected component analysis. The list of connected components is rst divided into two parts, base components and di- acritics (including dots). This division is based on connected component’s height, their width, and the ratio of the two. In the context of this chapter, font variations are not considered and the primary focus is to cover the typically used fonts in Urdu books and magazines. Therefore, thresholds for separating main ligatures and diacritics are set empirically on primary font’s size and they remain same for all the document images in our dataset.
It is not possible to separate Urdu ligatures by a single threshold value. Therefore, di erent thresholds have been employed as per properties of a particular ligature.
For ligature consisting of single ‘ ’ (alif ), the average height to width ratio is 4.0 and the average width of this ligature is around 6 pixels. For ligatures like ‘ ’ (bay), ‘ ’
(tay ) and ‘ ’ (say ), the average height to width ratio is 0.4 and the average width is around 30 pixels. For all other ligatures, it is su cient to employ a width greater than 10 pixels.
If there are no diacritics in a ligature, e.g., in ‘ ’, then no further processing is
needed. However, if one or more diacritics are present, e.g., in ‘ ’, then these diacritics must be associated to the base component to completely extract a ligature. Diacritics are searched in the neighborhood of a base component by extending the bounding box of the base connected component. This window size depends on the font size; but since we have used documents only with the dominant font size, this window is set according to that font size. Presently, the bounding box of the base component is extended by 15 pixels on top and bottom and by 10 pixels on right and left. Ligatures extracted in this manner are then saved to a database le for further clustering and labeling.
4.2.3 Clustering
As it has been mentioned that due to the huge amount of ligatures that are present in a cursive script, the labeling of individual ligatures becomes highly impractical. Hence, it is proposed that the extracted ligatures be clustered according to similar shapes. For the purpose of clustering, the epsilon-net clustering technique is employed. By simply changing the value of epsilon, we can control the number of clusters. The value for epsilon is set empirically to get moderate amount of clusters, so that they can
be managed easily at the manual step of validation. The features used for epsilon clustering are bit maps of the ligatures. Moreover, this method is relatively faster than the k-means clustering.

4.2.4 Ligature Labeling
The next step is to verify the clusters and modify them manually, if needed. The OCRo- pus framework provides a pro cient graphical user interface to do this without much hassle. It is also possible that the clustering divides a single ligature in more than one cluster (see Figure 4.2-(a)), hence one needs to merge di erent clusters to save time at a latter stage of labeling. Moreover, one can also modify the step of dividing a clus- ter in a way to retain only valid members (same label as that of representative), assign null class to incorrect members and then apply further iterations of clustering on null class. In the current work, merging of same ligature clusters precedes the manual la- beling and only single cluster iteration is employed. After this veri cation step, each cluster is examined individually to identify invalid clusters, which are then discarded. Again OCRopus framework is used for this purpose (see Figure 4.2-(b)).
At the end of this labeling process, we have a database whose entries indicate the following information about a ligature:
• Image le name from where this ligature was originally extracted.
• Bounding box information regarding the location of a ligature in the document image.
• Unicode string corresponding to the character forming this ligature.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.2.5 Experiments and Results
This section describes the experimental setup and the evaluation of the results. Two prototype datasets for Urdu script have been developed using the proposed tech- nique. One dataset consists of clean documents such as that shown in Figure 2-(a). At present, only 20 such document images have been used. Here, this dataset is referred to as DB-I. The second dataset (referred to as DB-II) consists of 15 documents written by a calligrapher such as that shown in Figure 4.1-(b). An important property of cal- ligraphic documents is that the shape of a ligature does not remain identical within the document and minor di erences in their shapes may remain throughout the docu- ment. GT information about the DB-II is available which is used to gauge the accuracy of line segmentation algorithm. The importance of choosing these two datasets is to evaluate upper and lower bounds on the performance of the proposed algorithm. Performance evaluation metric used in the present work is ligature coverage, which refers to the number of ligatures in the dataset that are correctly labeled by the clus- tering step, followed by the manual validation step.
The ligature extraction algorithm is able to nd 16,857 ligatures in DB-I database. The epsilon-net based clustering, then clusters these ligatures into 778 clusters. Each individual cluster is subsequently examined to verify the clustering and the Unicode values are assigned to new clusters. The invalid ligatures are discarded at this point. The ligature coverage achieved by this process is 82.3%. This high ligature coverage is due to su cient line spacing and non-touching ligatures.
The inherent di culty with any connected-component analysis based method is the poor accuracy in case of overlapping lines and touching ligatures. To solve this problem in DB-II dataset, a line segmentation algorithm [BSB11] is employed. The segmentation accuracy of this algorithm is over 90%. The second problem of touching ligatures may be improved by using more sophisticated techniques. However, we are not interested in ne separation of individual ligatures as the errors may be corrected at latter manual labeling stage. Hence, we did not tackle this problem in this work.
From DB-II database, a total of 18,914 ligatures are extracted. Then the cluster- ing of these ligatures resulted in 1,132 clusters. After the labeling process, the total ligature coverage is around 62.7%. Inconsistency in ligatures’ shape due to the hand- writing of the calligrapher results in poor clustering accuracy for DB-II dataset. In this case, simple shape-based clustering methods do not work su ciently and other methods are needed to be explored.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.2.6 Conclusion
A semi-automated methodology is proposed to generate a GT database from scanned documents for cursive scripts at ligature level. The same methodology can be used for rapid generation of character level datasets for other scripts as well. One unsatisfac- tory aspect of this methodology is the use of heuristics to extract ligatures from the document images. These heuristics need to be adapted accordingly for other scripts. It is also observed that the performance of this method is directly a ected by the choice of segmentation method and the quality of document images.
The second approach to develop a large-scale GT OCR database is to use the image degradation models. This approach is described in the following section.
4.3 Synthetic Text-Line Generation
The use of arti cial data is getting popular in computer vision domain for object recog- nition purpose. A similar path is taken in this thesis to address the issue of limited GT data. Baird [Bai92] proposed several degradation models to generate arti cial data from the text (ASCII) form. There are many parameters that can be altered to make the arti cially generated text-line images resemble closely to those obtained from a scanning process. Some of the signi cant parameters are:
Blur: Itisthepixel-wisespreadintheoutputimage,andismodeledascircularGaus- sian lter.
Threshold: It is used to distort the image by randomly removing the text pixels. If a pixel value is greater than this threshold, then it is a background pixel.
Size: It is the height and width of individual characters in the image. It is modeled by image scaling operations.
Skew: It is the rotation angle of the output symbol. The resulting angle is skewed to right or left by specifying the ‘skew’ parameter.
In this thesis, a utility based on these degradation models from OCRopus [OCR15](open-source OCR framework) is used to generate the arti cial data. The aforesaid OCRopus utility requires utf-8-encoded text-lines to generate the corresponding text- line images along with the ttf -type font les. The process of line image generation is shown in Figure 4.3. The user can specify the parameter values or use the default values.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.4 OCR Databases
This section lists various datasets that have been developed using the synthetic gen- eration process as described in the previous section. These datasets are available freely for research purposes and can be obtained from the author.
4.4.1 Deva-DB – OCR Database for Devanagari Script
A new database of Devanagari script is presented1 to advance the OCR research in Devanagari OCR. It can provide a common platform for researchers in this domain to benchmark their algorithms.
This database, named Deva-DB consists of two parts. The rst part is the text-lines taken from the work of Setlur et al. [KNSG05]. The GT information is represented in transliteration form in this work. We have transcribed 621 text-lines manually in standard Unicode form. This data is used for evaluation purposes only, that is to test the LSTM model trained with the arti cial data.
Second part of this database is the synthetically generated text-line images using OCRopus framework. These text-lines were chosen from various on-line resources covering the elds of current-a airs, religion, classical literature and science. Some sample images from this database are shown in Figure 4.4.

To check the quality of the training set, a comparison is made between the char- acter and word statistics of this set with that published by Chaudhuri et al. [CP97]. The ten most frequently used characters in Hindi based on three million Hindi words are shown in Table 4.1. We collected a similar statistic for our training data based on approximately one million words. As seen from Table 4.1, the relative frequency of the characters in the proposed training set is similar to that provided by Chaudhuri et al. [CP97] [CP97]. IIIT Hyderabad, India [III14] has published a word frequency from a Hindi language corpus containing three million words. The proposed training set also matches the top ten frequent words of that work.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.4.2 Polyton-DB – Greek Polytonic Database
A collection of some transcribed data is available under OldDocPro project [GSL+]2, for the recognition of machine-printed and handwritten polytonic documents. The big issue of large amount of transcribed data for training is overcome by the use of synthetic data (generated using OCRopus framework). The contribution of this thesis is the introduction of synthetic text-line images as part of the freely available Greek Polytonic Database, called Polyton-DB3, which includes printed polytonic Greek docu- ments from three main sources:
GreekParliamentProceedings: Thisdatabase(asampleshowninFigure4.5-a)con- tains 3203 text-lines images, that are taken from 33 scanned pages of the Greek Parliament proceedings. The documents used in this database correspond to the speeches of Greek politicians of various eras in the 19th and the 20th cen- turies.
Greek o cial government Gazette: These documents consists of 687 text-line im- ages (a sample shown in Figure 4.5-b), which are picked from ve scanned pages of Greek O cial Government Gazette.
Appian’s Roman History: 315 documents from Appians’s Roman history (written in Greek language before AD 165) are used to create this database (a sample shown in Figure 4.5-c). It contains 11,799 text-line images. Appian’s Roman history scans are clean images which is not the usual case with historical doc- uments. Moreover, the writing style is di erent than the other two resources. To better train the OCR models for historical documents, synthetically degraded text-lines are generated from this corpus.

from awesome-ocr.

wanghaisheng commented on August 15, 2024

4.4.3
Databases for Multilingual OCR (MOCR)
MOCR is a relatively new eld of research and there are not many publicly available databases to test the OCR algorithms in this regard. Moreover, the e orts till now have focused on script identi cation instead of complete processing of such docu- ments. Most of the reported works have utilized datasets that are either private or no longer available. There are three main contributions of this thesis to test the MOCR algorithms.

• A database to gauge the cross-language performance of any OCR algorithm. The text-lines in this corpus are generated synthetically using the procedure
described in Section 4.3. This database consists of three European languages, namely English, German, French, and a mixed set of all these languages. This dataset consists of 370,799 text-line images (number of text-lines in each lan- guage is given in Table 4.2). Some sample text-lines from this versatile database are shown in Figure 4.6.
• Alargeamountoftext-linesusedtomeasuretheperformanceofscriptidenti - cation methodology (reported in Chapter 7), have been generated arti cially. Di erent text corpora have been used to develop separate training and test data. There are 90,000 text-line images in the training set and 9,500 text-line images in the test set. The GT information is available in the form of both actual text and assigned class labels.
• TovalidatethehypothesisofthegeneralizedOCRframework(reportedinChap- ter 8), a database from a English-Greek bilingual document has been created us- ing the synthetic text-line image protocol. A total of 90,000 text-line images for the training phase and 9,900 text-line images for the evaluation purposes have been generated.

from awesome-ocr.

Adnan Ul-Hasan的博士论文-第四章训练数据 about awesome-ocr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent