Code Monkey home page Code Monkey logo

prodigy-pdf-custom-recipe's Introduction

๐Ÿช spaCy Project: Prodigy recipes for document processing and layout understanding

This repository contains recipes on how to use Prodigy and Hugging Face for annotating, training, and reviewing document layout datasets. We'll be finetuning a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.

This also serves as an illustration of how to design document processing solutions. I attempted to generalize this approach into a framework, which you can read more from my blog.

๐Ÿ“‹ project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

โฏ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
install Install dependencies
hydrate-db Hydrate the Prodigy database with annotated data from FUNSD
review Review hydrated annotations
train Train FUNSD model
qa Perform QA for the test dataset using a trained model
clean-db Drop all generated Prodigy datasets
clean-files Clean all intermediary files

โญ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all install โ†’ hydrate-db โ†’ train
clean-all clean-db โ†’ clean-files

๐Ÿ—‚ Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/funsd.zip URL FUNSD dataset - noisy scanned documents for layout understanding

prodigy-pdf-custom-recipe's People

Contributors

ljvmiranda921 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

prodigy-pdf-custom-recipe's Issues

Support for multi-page pdf documents?

Hi @ljvmiranda921 , came here after reading your beautifully-written A framework for designing document processing solutions article, thank you for sharing! I have some pdf documents that I want to perform custom NER on; these documents include single-page and multi-page documents.

I have a few questions:

  • Does your workflow support NER on multi-page documents as well?
  • Would I have to convert all the documents into images first, store them in a directory, and the feed the images into your pipeline for annotation and training?
  • Will I need to split the dataset for training and testing myself, or will Prodi.gy do it for me?

Just got my Prodi.gy license today and still working on learning the tool. Thanks!

Mac support

When running on macOs the command should be changed from:

"sudo apt install tesseract-ocr -y"

to:

"brew install tesseract"

Maybe there's a way to detect the platform?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.