Code Monkey home page Code Monkey logo

benchmarking-chinese-text-recognition's Introduction

Benchmarking-Chinese-Text-Recognition

This repository contains datasets and baselines for benchmarking Chinese text recognition. Please see the corresponding paper for more details regarding the datasets, baselines, the empirical study, etc.

Highlights

🌟 All datasets are transformed to lmdb format for convenient usage.

🌟 The experimental results of all baselines are available at link with format (index [pred] [gt]).

🌟 The code and trained weights of all baselines are available at link for direct use.

Updates

Dec 2, 2022: An updated version of the corresponding paper is available at arXiv.

Aug 22, 2022: We upload the lmdb datasets of hard cases.

Jun 15, 2022: The experimental settings are modified. We upload the code and trained weights of all baselines.

Jan 3, 2022: This repo is made publicly available. The corresponding paper is available at arXiv.

Nov 26, 2021: We upload the lmdb datasets publicly to Google Drive and BaiduCloud.

Download

  • The lmdb scene, web and document datasets are available in BaiduCloud (psw:v2rm) and GoogleDrive.

  • The lmdb datasets of hard cases can be downloaded from BaiduCloud (psw:n6nu) and GoogleDrive; the lmdb dataset for examples of synthetic CTR data is available in BaiduCloud (psw:c4sl).

  • The lmdb dataset of hard cases can be downloaded from BaiduCloud (psw:n6nu).

  • For the handwriting setting, please first download it at SCUT-HCCDoc and divide it into training, validation, and testing sets following link.

  • We also collected HWDB2.0-2.2 and ICDAR2013 handwriting datasets from CASIA and ICDAR2013 competition for futher research. Datasets are available at BaiduCloud (psw:lfaq) and GoogleDrive.

Datasets

Alt text The image demonstrates the four datasets used in our benchmark including Scene, Web, Document, and Handwriting datasets, each of which is introduced next.

Scene Dataset

We first collect the publicly available scene datasets including RCTW, ReCTS, LSVT, ArT, CTW resulting in 636,455 samples, which are randomly shuffled and then divided at a ratio of 8:1:1 to construct the training, validation, and testing datasets. Details of each scene datasets are introduced as follows:

  • RCTW [1] provides 12,263 annotated Chinese text images from natural scenes. We derive 44,420 text lines from the training set and use them in our benchmark. The testing set of RCTW is not used as the text labels are not available.
  • ReCTS [2] provides 25,000 annotated street-view Chinese text images, mainly derived from natural signboards. We only adopt the training set and crop 107,657 text samples in total for our benchmark.
  • LSVT [3] is a large scale Chinese and English scene text dataset, providing 50,000 full-labeled (polygon boxes and text labels) and 400,000 partial-labeled (only one text instance each image) samples. We only utilize the full-labeled training set and crop 243,063 text line images for our benchmark.
  • ArT [4] contains text samples captured in natural scenes with various text layouts (e.g., rotated text and curved texts). Here we obtain 49,951 cropped text images from the training set, and use them in our benchmark.
  • CTW [5] contains annotated 30,000 street view images with rich diversity including planar, raised, and poorly-illuminated text images. Also, it provides not only character boxes and labels, but also character attributes like background complexity, appearance, etc. Here we crop 191,364 text lines from both the training and testing sets.

We combine all the subdatasets, resulting in 636,455 text samples. We randomly shuffle these samples and split them at a ratio of 8:1:1, leading to 509,164 samples for training, 63,645 samples for validation, and 63,646 samples for testing.

Web Dataset

To collect the web dataset, we utilize MTWI [6] that contains 20,000 Chinese and English web text images from 17 different categories on the Taobao website. The text samples are appeared in various scenes, typography and designs. We derive 140,589 text images from the training set, and manually divide them at a ratio of 8:1:1, resulting in 112,471 samples for training, 14,059 samples for validation, and 14,059 samples for testing.

Document Dataset

We use the public repository Text Render [7] to generate some document-style synthetic text images. More specifically, we uniformly sample the length of text varying from 1 to 15. The corpus comes from wiki, films, amazon, and baike. The dataset contains 500,000 in total and is randomly divided into training, validation, and testing sets with a proportion of 8:1:1 (400,000 v.s. 50,000 v.s. 50,000).

Handwriting Dataset

We collect the handwriting dataset based on SCUT-HCCDoc [8], which captures the Chinese handwritten image with cameras in unconstrained environments. Following the official settings, we derive 93,254 text lines for training and 23,389 for testing, respectively. To pursue more rigorous research, we manually split the original training set into two sets at a ratio of 4:1, resulting in 74,603 samples for training and 18,651 samples for validation. For convenience, we continue to use the original 23,389 samples for testing.

Overall, the amount of text samples for each dataset is shown as follows:

  Setting     Dataset     Sample Size     Setting     Dataset     Sample Size  
Scene Training 509,164 Web Training 112,471
Validation 63,645 Validation 14,059
Testing 63,646 Testing 14,059
Document Training 400,000 Handwriting Training 74,603
Validation 50,000 Validation 18,651
Testing 50,000 Testing 23,389

Baselines

We manually select six representative methods as baselines, which will be introduced as follows.

  • CRNN [9] is a typical CTC-based method and it is widely used in academia and industry. It first sends the text image to a CNN to extract the image features, then adopts a two-layer LSTM to encode the sequential features. Finally, the output of LSTM is fed to a CTC (Connectionist Temperal Classification) decoder to maximize the probability of all the paths towards the ground truth.

  • ASTER [10] is a typical rectification-based method aiming at tackling irregular text images. It introduces a Spatial Transformer Network (STN) to rectify the given text image into a more recognizable appearance. Then the rectified text image is sent to a CNN and a two-layer LSTM to extract the features. In particular, ASTER takes advantage of the attention mechanism to predict the final text sequence.

  • MORAN [11] is a representative rectification-based method. It first adopts a multi-object rectification network (MORN) to predict rectified pixel offsets in a weak supervision way (distinct from ASTER that utilizes STN). The output pixel offsets are further used for generating the rectified image, which is further sent to the attention-based decoder (ASRN) for text recognition.

  • SAR [12] is a representative method that takes advantage of 2-D feature maps for more robust decoding. In particular, it is mainly proposed to tackle irregular texts. On one hand, SAR adopts more powerful residual blocks in the CNN encoder for learning stronger image representation. On the other hand, different from CRNN, ASTER, and MORAN compressing the given image into a 1-D feature map, SAR adopts 2-D attention on the spatial dimension of the feature maps for decoding, resulting in a stronger performance in curved and oblique texts.

  • SEED [13] is a representative semantics-based method. It introduces a semantics module to extract global semantics embedding and utilize it to initialize the first hidden state of the decoder. Specifically, while inheriting the structure of ASTER, the decoder of SEED intakes the semantic embedding to provide prior for the recognition process, thus showing superiority in recognizing low-quality text images.

  • TransOCR [14] is one of the representative Transformer-based methods. It is originally designed to provide text priors for the super-resolution task. It employs ResNet-34 as the encoder and self-attention modules as the decoder. Distinct from the RNN-based decoders, the self-attention modules are more efficient to capture semantic features of the given text images.

Here are the results of the baselines on four datasets. ACC / NED follow the percentage format and decimal format, respectively. Please click the hyperlinks to see the detailed experimental results, following the format of (index [pred] [gt]).

  Baseline     Year   Dataset
      Scene              Web          Document    Handwriting 
CRNN [9] 2016 54.94 / 0.742 56.21 / 0.745 97.41 / 0.995 48.04 / 0.843
ASTER [10] 2018 59.37 / 0.801 57.83 / 0.782 97.59 / 0.995 45.90 / 0.819
MORAN [11] 2019 54.68 / 0.710 49.64 / 0.679 91.66 / 0.984 30.24 / 0.651
SAR [12] 2019 53.80 / 0.738 50.49 / 0.705 96.23 / 0.993 30.95 / 0.732
SEED [13] 2020 45.37 / 0.708 31.35 / 0.571 96.08 / 0.992 21.10 / 0.555
TransOCR [14] 2021 67.81 / 0.817 62.74 / 0.782 97.86 / 0.996 51.67 / 0.835

References

Datasets

[1] Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17). ICDAR, 2017.

[2] Zhang R, Zhou Y, Jiang Q, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. ICDAR, 2019.

[3] Sun Y, Ni Z, Chng C K, et al. ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. ICDAR, 2019.

[4] Chng C K, Liu Y, Sun Y, et al. ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-ArT. ICDAR, 2019.

[5] Yuan T L, Zhu Z, Xu K, et al. A large chinese text dataset in the wild. Journal of Computer Science and Technology, 2019.

[6] He M, Liu Y, Yang Z, et al. ICPR2018 contest on robust reading for multi-type web images. ICPR, 2018.

[7] text_render: https://github.com/Sanster/text_renderer

[8] Zhang H, Liang L, Jin L. SCUT-HCCDoc: A new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognition, 2020.

Methods

[9] Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, 2016.

[10] Shi B, Yang M, Wang X, et al. Aster: An attentional scene text recognizer with flexible rectification. TPAMI, 2018.

[11] Luo C, Jin L, Sun Z. Moran: A multi-object rectified attention network for scene text recognition. PR, 2019.

[12] Li H, Wang P, Shen C, et al. Show, attend and read: A simple and strong baseline for irregular text recognition. AAAI, 2019.

[13] Qiao Z, Zhou Y, Yang D, et al. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. CVPR, 2020.

[14] Chen J, Li B, Xue X. Scene Text Telescope: Text-Focused Scene Image Super-Resolution. CVPR, 2021.

Citation

Please consider citing this paper if you find it useful in your research. The bibtex-format citations of all relevant datasets and baselines are at link.

@article{chen2021benchmarking,
  title={Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study},
  author={Chen, Jingye and Yu, Haiyang and Ma, Jianqi and Guan, Mengnan and Xu, Xixi and Wang, Xiaocong and Qu, Shaobo and Li, Bin and Xue, Xiangyang},
  journal={arXiv preprint arXiv:2112.15093},
  year={2021}
}

Acknowledgements

We sincerely thank those researchers who collect the subdatasets for Chinese text recognition. Besides, we would like to thank Teng Fu, Nanxing Meng, Ke Niu and Yingjie Geng for their feedbacks on this benchmark.

Copyright

The team includes Jingye Chen, Haiyang Yu, Jianqi Ma, Mengnan Guan, Xixi Xu, Xiaocong Wang, and Shaobo Qu, advised by Prof. Bin Li and Prof. Xiangyang Xue.

Copyright © 2021 Fudan-FudanVI. All Rights Reserved.

Alt text

benchmarking-chinese-text-recognition's People

Contributors

aimpressionist avatar hyangyu avatar jingyechen avatar windmillrip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

benchmarking-chinese-text-recognition's Issues

The HWDB and ICDAR2013

Thank you very much for your work. Could you please supplement the experimental results on HWDB and ICADA2013. These two data sets are very important in Chinese handwriting recognition and have a relatively large amount of work, so that it is easier to compare the performance differences between different methods.

divide_scut.py代码有误

第62行的index应为cnt,否则train和val的数量与论文中不一致。
第64行中的train_save_path应为validation_save_path

Please upload the supported files.

There are some supported file doesn't upload in this repo.:

parser.add_argument('--alpha_path', type=str, default='./data/benchmark.txt', help='')
parser.add_argument('--alpha_path_radical', type=str, default='./data/radicals.txt', help='')
parser.add_argument('--decompose_path', type=str, default='./data/decompose.txt', help='')

Thanks!

requirements issues

Please could you provide an explanation file for the requirements, as many data packages become unusable after an upgrade.

What needed to prepare if I want to use the custom dataset

Thanks for your sharing of the code. I would like to use my own data to train the model.
What should I do?
My dataset looks like this:

Dataset
││
└───TrainDataset
│   │   TrainImg1.jpg
│   │   TrainImg2.jpg
│   │   ......
└───TestDataset
│   │   TestImg1.jpg
│   │   TestImg2.jpg
│   │   ......
└───Label.json ( includes the bbox and the character labels for train and test dataset)

SCUT-HCCDoc

你好,请问有关这个手写体数据集的解压密码吗?对于手写体数据,SCUT-EPT也是手写体,请问为什么没有加入训练呢?

Cannot reproduce the results reported in the paper

Hello, thanks for the sharing of such an awesome research work.

About this work, I have one question. When I run the baseline experiments by using the default configurations in the released code, the obtained result is much lower than the results reported in the paper. To reproduce the same result of the paper, could the author provide the training detail configurations of all models?

Thanks! Wait for your reply.

竖排及不规则识别

您好,非常感谢您的工作。我是了您提供的demo,发现模型对竖排和不规则字符图片识别不准确,请问为什么,还需要做哪些预处理?谢谢。

Datasets

你好,制作的lmdb数据加载只能加载一个文件吗?我数据集分布很杂,在多个文件夹下,不好合并,有没有好的办法,谢谢!

关于decompose.txt

想请问 decompose.txt 是如何产生的呢,因为这裡拆成的radical level比一般看到的IDS方法更细,想知道radicals是如何决定以及decompose的方法,谢谢。

关于数据集的数据格式

您好 想问一下关于您提供的
image

这个数据集 这个数据可以直接用于训练吗 他的数据格式是怎么样的 我要做哪些修改 麻烦了

Some curiosity about the baseline training configuration.

Thank you for your contribution to the community, the creation of the Chinese recognition benchmark is critical to the advancement of the field.

I would like to know some more details about the training configuration of some baseline models for the sub-dataset "scene". Such as specific epoch, batchsize, lr, weight_decay, max_length, grad_clip, etc.

In particular, I noticed that the results of TransOCR reported in the paper (arXiv:2112.15093) are different from the results on GitHub. After the paper was submitted to arxiv, better experimental results were obtained based on different experimental hyperparameters?

Thank you again for your outstanding work and hope you can get back to me. Because I didn't find the specific configuration file, and the default parameters in TransOCR/main.py, such as epoch = 1000. Unfortunately, my current hardware cannot support such a long training time.

baseline的模型

您好,请问有scene部分各个baseline的结果模型提供吗?非常感谢

数据集

因为想要看一下原始数据集(Scene Dataset),分析一下数据特征请问有什么方法可以将LMDB数据集转化为普通的数据集和标签,或者什么地方可以直接下载该项目中lmdb数据集对应的数据集图片和标签

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.