Code Monkey home page Code Monkey logo

dedoc's Introduction

Dedoc

Dedoc

Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.

Features and advantages

Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. Document structure extraction is fully automatic regardless of input data type. Metadata and text formatting is also extracted automatically.

In 2022, the system won a grant to support the development of promising AI projects from the Innovation Assistance Foundation (Фонд содействия инновациям).

Dedoc provides:

  • Extensibility due to a flexible addition of new document formats and to an easy change of an output data format.
  • Support for extracting document structure out of nested documents having different formats.
  • Extracting various text formatting features (indentation, font type, size, style etc.).
  • Working with documents of various origin (statements of work, legal documents, technical reports, scientific papers) allowing flexible tuning for new domains.
  • Working with PDF documents containinng a text layer:
    • Support to automatically determine the correctness of the text layer in PDF documents;
    • Extract containing and formatting from PDF-documents with a text layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification. Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
    • Recognizing a physical structure and a cell text for complex multipage tables having explicit borders with the help of contour analysis.
  • Working with scanned documents (image formats and PDF without text layer):
    • Using Tesseract, an actively developed OCR engine from Google, together with image preprocessing methods.
    • Utilizing modern machine learning approaches for detecting a document orientation, detecting single/multicolumn document page, detecting bold text and extracting hierarchical structure based on the classification of features extracted from document images.

This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part)

This project has REST Api and you can run it in Docker container To read full Dedoc documentation run the project and go to localhost:1231.

Run the project

How to build and run the project

Ensure you have Git and Docker installed

Clone the project

git clone https://github.com/ispras/dedoc.git

cd dedoc/

Ensure you have Docker installed. Start 'Dedoc' on the port 1231:

docker-compose up --build

Start Dedoc with tests:

tests="true" docker-compose up --build

Now you can go to the localhost:1231 and look at the docs and examples.

You can change the port and host in the config file 'dedoc/config.py'

dedoc's People

Contributors

anfedotoff avatar dronperminov avatar ilyakozlov avatar nastyboget avatar oksidgy avatar sunveil avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.