Code Monkey home page Code Monkey logo

m6doc's Introduction

M6Doc_Dataset_Release

The M6Doc dataset for the research of document layout analysis in Modern Document is now released by the Deep Learning and Visual Computing Lab of South China University of Technology. The test dataset can be downloaded through the following link:

Note: The M6Doc dataset can only be used for non-commercial research purposes. For scholars or organizations who want to use the M6Doc database, please first fill in this Application Form and send it via email to us ([email protected] or [email protected]). When submitting the application form to us, please list or attach 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of OCR, handwriting analysis and recognition, document image processing, or visual information extraction. At present, this dataset is only freely available to scholars in the above-mentioned fields. We will give you the download links and decompression passwords for the training dataset and validation dataset after your letter has been received and approved.

License

The M6Doc dataset should be used and distributed under the Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License for non-commercial research purposes.

M6Doc Dataset

The M6Doc dataset contains a total of 9,080 modern document images, which are categorized into seven subsets, i.e., scientific article (11%), textbook (23%), test paper (22%), magazine (22%), newspaper (11%), note (5.5%), and book (5.5%) according to their content and layouts. It contains three formats: PDF (64%), photographed documents (5%), and scanned documents (31%). The dataset includes a total of 237,116 annotated instances.

Dataset Source

The M6Doc datasets were collected from various sources, including arXiv, the official website of the Chinese People's Daily, and VKontakte. The source and composition of different subsets are shown below.

  • The scientific article subset includes articles obtained by searching with the keywords "Optical Character Recognition" and "Document Layout Analysis" on arXiv. PDF files were then downloaded and converted to images.
  • The textbook subset contains 2,080 scanned document images from textbooks for three grades (elementary, middle, and high school) and nine subjects (Chinese, Math, English, Physics, Chemistry, Biology, History, Geography, and Politics).
  • The test paper subset consists of 2,000 examination papers covering the same nine subjects as the textbook subset.
  • The magazine subset includes 1,000 Chinese and English magazines in PDF format, respectively. The Chinese magazines were sourced from five publishers: Global Science, The Mystery, Youth Digest, China National Geographic, and The Reader. The English magazines were sourced from five American publishers: The New Yorker, New Scientist, Scientific American, The Economist, and Time USA.
  • The newspaper subset contains 500 PDF document images from the Chinese People's Daily and the Wall Street Journal.
  • The note subset consists of students' handwritten notes in nine subjects, including 500 scanned pages.
  • The book subset contains 500 photographed images, which were acquired from 50 books with 10 pages each. Each book has a distinct layout, resulting in considerable diversity in this subset.

Data Annotation

Label definition

To ensure that the definition of document layout elements is reasonable and traceable, we reviewed relevant information, such as layout knowledge and layout design. We also used knowledge from the book "Page Design: New Layout & Editorial Design(2019)" and referred to YouTube video explanations regarding magazine and newspaper layouts. In most cases, we followed the Wikipedia definition. Consequently, we defined 74 detailed document annotation labels. The key factors in selecting these annotation labels include (1) the commonality of annotation labels between different document types, (2) the specificity of labels between different document types, (3) the frequency of labels, and (4) the recognition of independent pages. We first unified the labels between different documents to the maximum extent and then defined the labels for certain document types for differentials. Commonality and specificity ensure that the defined labels can adapt to multiple document types, which implies that a more detailed logical layout analysis for a certain type of document can be performed. It differs from how labels are defined in DocBank, PubLayNet, and DocLayNet, which all ignore defining specific labels for different document types. Figure 1 shows annotation samples of M6Doc. There are a total of 74 annotation categories in our dataset.

Figure 1. Example annotations of the M6Doc. Zoom in for better view.

For a fair evaluation, we divided the dataset into training, validation, and test sets in a ratio of 6:1:3. We also ensured that the different labels were in equal proportions in the three sets. Table 2 summarizes the overall frequency and distribution of labels in different sets.

Table 2. M6Doc dataset overview.

Annotation guideline

We provide a detailed annotation guideline(guideline_chinese or guideline_en(Update later)) (over 170 pages) and some typical annotation examples. 47 annotators performed the annotation task strictly according to the guidelines.

Directory Format

The dataset is organized in the following directory format:

├── M6Doc
    ├── annotations
    │   ├── instances_train2017.json
    │   │── instances_val2017.json
    │   └── instances_test.json
    ├── train2017
    │   ├── xxx.jpg
    │   └── ...
    ├── val2017
    │   ├── xxx.jpg
    │   └── ...
    └── test2017
        ├── xxx.jpg
        └── ...

Citation and Contact

Please consider to cite our paper when you use our dataset:

@InProceedings{Cheng_2023_CVPR,
    author    = {Cheng, Hiuyi and Zhang, Peirong and Wu, Sihang and Zhang, Jiaxin and Zhu, Qiyuan and Xie, Zecheng and Li, Jing and Ding, Kai and Jin, Lianwen},
    title     = {M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {15138-15147}
}

For any questions about the dataset, please contact the authors by sending an email to Prof. Jin([email protected], or [email protected]).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.