mindee / doctr Goto Github PK

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Home Page: https://mindee.github.io/doctr/

License: Apache License 2.0

Python 99.67% Dockerfile 0.22% Makefile 0.11%

ocr deep-learning document-recognition tensorflow2 text-detection-recognition text-detection text-recognition optical-character-recognition pytorch

doctr's People

Contributors

Stargazers

Watchers

Forkers

foeinlove kapitsa2811 fireae chaaaarles numerique-gouv karimul hubayirp cuulee menonnsl 3exmachina sibtainrazajamali igledaniel abuuwais21 suresh fcrohas akash-agr edcastaneda chetan8000 stjordanis prudhvimadasu adbmd lakithasahan arthur-acn sullah2 satichandrala taner45 laplacekorea zalakbhalani ratebseirawan ducbluee folafunmi-db andres-mejia amulyakali thiagosantos-mendel thiagosantos1 zgsxwsdxg mzntaka0 anujgoku jeremi-nh ocrbyyue iamvarol yasminezakaria-m ajinkyapuar osanseviero summon-ml lmag mzeidhassan techthiyanes ak391 h2oai jonathanmindee sungsu4934 databill86 aman2112garg kforcodeai ajaykumarbharaj ananyanayan fmobrj kaushal-idx faizan1041 felixdittrich92 rbmindee av-savchenko wyfo thentgesmindee davidko3 thorpham evysb lalitpagaria aravindr7 standardgalactic elejke lfxuan lannguyen0910 shashimalcse bbsvip shalevy1 gov-ai gauravlochab vinayreddy100 vantk85 ammarali32 david-thul basemelbarashy mcshih chenhuayou lopsdir suryatmodulus nineves poornasainagendra modaccount iamirulofficial pandinosaurus ruby-he johnson066 beyondyourself xargonus jay-iit jasonantonio10 flflfe

doctr's Issues

[models] Implement an Artefact object detector

Following up on #15, I believe we should consider a major feature for artefact detection. For now, the different artefact that we consider would be:

Check boxes
QR Codes (#259)
Bar codes (#260)
Signature / Signed initials
Pictures (faces #258)
Logo / watermarks

Training an object detection model might be a good option to start from!

[models] Add text recognition module

Design a model subpart that is responsible to identify text strings inside the regions of interest of an image

Input

images: Numpy-style encoded (cropped) images (already read), expected to hold a single character sequence

Output

text: list of N strings, where N = number of cropped input images

The following components would be required:

Preprocessor (#20, #33)
RecognitionModel (#35)
RecognitionProcessor (#37)
RecognitionPredictor (#39)

[utils] Move doctr dependences out of doctr.utils

The doctr.utils.visualization introduces a dependency of other modules, which might be troublesome later. Several options are at our disposal:

sterilize these imports, and implement the specific version in the modules of the former dependency
use exported dictionary versions of element to plot (typing would then not require the imports of doctr.documents)

Any other suggestion is welcome!

[models] detect page orientation

We should be able to detect the rotation of a page (angle from 0 to 359°) to redress the page before sending it to the OCR.
This would highly improve our predictor on "tricky" datasets where pages are often rotated.

Some documents are very complex and have areas of text with different orientations, but even for these documents we can define a main orientation for the page (most of the lines would be oriented this way).

This leads us to define 2 levels of orientation:

Page orientation: for most of the documents it would be the orientation of all text lines, for tricky documents the main orientation of the lines (most of the lines are oriented this way).
Box/line/block level orientation: Once the document is rotated (after the detection of page orientation), find the areas which are still rotated and redress only these areas.

Any suggestion is welcome! I think firstly we should implement page orientation which should be relatively easy and then focus on the other part which is far more trickier.

@fg-mindee

[ci] Setup basic PR checks

The following should be setup:

lint checking using flake8
typing annotation check with mypy
unittests using pytest and coverage
package installation verification

[Models] Add SAR teacher forçing during training (decoder)

Teacher forçing (feeding the LSTM decoder of SAR with ground-truth characters during training) is not implemented, and it should really improve performances

[document] Check integrity of PDF --> img conversion with default DPI

The current PDF reading implies a conversion to an image. As we are using the default args for this conversion, the DPI value might be low. We need to check that the default parameter is not bringing about performance issues due to low resolution rendering.

[models] Models should accept list of pages as inputs and not list of documents

The current behaviour overcomplicates things since a document is already a list of pages.
The only advantage of the current behaviour is to save potentially 1 forward. Assuming we have 1 document with A pages and a second document with B pages, and our batch size N:

currently, we do math.ceil((A + B) / N) forwards
the proposition would have math.ceil(A / N) + math.ceil(B / N) forwards

The real advantage is when A and B are less than N.

The interface would be much cleaner though.

[documents] Benchmark PDF document reading + numpy conversion options

Currently, the core reading of PDF document is made with PyMuPDF. This needs to be benchmarked against alternatives to ensure we use the optimal backend here.

Write a conversion function from Image + bounding boxes to cropped images

The detection block is currently returning a list of bounding boxes, while the recognition block is actually using cropped images. The recognition pre-processing needs to handle this.

Input

images: Numpy-style encoded images (already read)
bounding boxes: list of tensor of predictions of size N*4 (xmin, ymin, xmax, ymax)

Output

cropped images: N cropped images

[models] BN layers call behaviour depends on a 2nd argument

According to TF documentation, BN layers have a second argument in their call method which changes its behaviour. This raises the following question:

how do we pass down this information in a Sequential for instance?
if it's possible, shouldn't this be a layer attribute that we can switch? (bn.training = False) since we won't want to change this behaviour upon each call.

[conda] Unable to make a conda build

Unfortunately, one of the project dependencies does not have any conda release or any way to make one. I opened an issue on their repo pymupdf/PyMuPDF#938 to track this, but so far I haven't found any way to release the project on anaconda with this dependency.

ImportError: cannot import name 'DocumentFile' from 'doctr.documents'

Whenever I try to execute the same code presented in the main page, I get this error : "ImportError: cannot import name 'DocumentFile' from 'doctr.documents' (/Users/Aksol/miniconda3/lib/python3.8/site-packages/doctr/documents/init.py)".
I can't find a solution around it, I tried cloning the repo in Google Colab and I still get the same error.
Has anyone came across the same problem and had a solution around it ?
Thank you

[documents] Harmonize file reading

A package user has to import different functions to read a file depending on its extension, and reading mean (from path, from bytes). This needs serious refactoring.

Handle reading means (#172)
Handle extensions (#172)

[models] Profiling models for temporal optimization

Line Profiler https://github.com/pyutils/line_profiler is used to perform model profiling, to highlight the most time-consuming lines when our models are running.

For an OCRPredictor with a sar_vgg16_bn coupled with a db_resnet50, we have the following analysis:

the whole recognition model (SAR) accounts for 85% of the total execution time (OCR predictor)
the whole detection model (DB) accounts for 15% of the total execution time

More precisely:

Recognition task: almost 100% of the time is spent inside the model (without pre/postprocessing). We have the following distribution inside the model: 22.5% for the feature extractor (VGG16 with bn), 17.5% for the encoder (2 LSTM layers with 512 hidden layers), and 60% for the decoder. The decoder has 2 main time-consuming tasks: a 31% goes to the lstm decoder (2 layers of stacked LSTM cells with 512 hidden layers), and a 62% goes to the attention module, and inside this module 87% of the time is spent in the conv2D operation (kernel 3x3, stride=1) which is encoding the feature map (N, H, W, feature_units) --> (N, H, W, attention_units=512). We can conclude for the recognition task that:

27.5% of the total execution time (ocr end to end) is spent in this conv2D layer, which can be reduced by decreasing the number of attention units from 512 to 256, maybe less.
15% of the total execution time is spent in the encoder and another 16% of the total execution time is spent in the LSTM decoder. Thus, LSTM layers accounts for more than 30% of the total execution time. This can be improved by reducing the number of encoding/decoding layers from 2 to 1, and reducing the number of hidden rnn units inside the LSTM cells from 512 to 256 for instance.

It is important to highlight that we have a significant leverage on the execution time of the whole model moving only 2 hyper-parameters: attention units and hidden units (almost 60% of the execution time end-to-end is directly impacted).

Detection task: 99% is spent inside the model and 1% for the post-processing. Inside the model, 65% are spent inside the feature extractor (Resnet50), which accounts for almost 10% in the whole model. This time can be reduced using a lighter resnet such as resnet18 for instance. The remaining 35% are spent in the pyramidal module and in the computing of the probability map.

[models] Add a descriptive table for pretrained models

The documentation does provide the list of implementation architectures, but some information is lacking without reading the code:

input image size
size of the model (# params)
performance

TypeError: Expected Ptr<cv::UMat> for argument 'array' when using read_pdf()

Can't execute read_pdf() function.

See the pdf file sent over slack to reproduce.

python 3.6.9

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order)
Traceback (most recent call last):
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/main.py", line 17, in <module>
    result = model([doc])
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/core.py", line 51, in __call__
    boxes = self.det_predictor(pages, **kwargs)
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/detection/core.py", line 140, in __call__
    out = [self.post_processor(batch) for batch in out]
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/detection/core.py", line 140, in <listcomp>
    out = [self.post_processor(batch) for batch in out]
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/detection/differentiable_binarization.py", line 173, in __call__
    boxes = self.bitmap_to_boxes(pred=p_, bitmap=bitmap_)
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/detection/differentiable_binarization.py", line 140, in bitmap_to_boxes
    _box = self.polygon_to_box(points)
  File "/home/jonathan/mindee/dev/client_tests//test_classifier/doctr/doctr/models/detection/differentiable_binarization.py", line 106, in polygon_to_box
    x, y, w, h = cv2.boundingRect(expanded_points)  # compute a 4-points box from expanded polygon
TypeError: Expected Ptr<cv::UMat> for argument 'array'

[docker] Set up a docker able to run the library

For users/developers working with containers, it would be very useful to have a Dockerfile that can run a working container with the library and minimal image size.

ValueError on model call

🐛 Bug

When calling the model on a document, I get a ValueError.

To Reproduce

Steps to reproduce the behavior:

from doctr.documents import read_pdf 
from doctr.models import ocr_db_crnn 
model = ocr_db_crnn(pretrained=True)                                                                                                                                                                
doc = read_pdf("path/to/pdf") # write your path to the pdf 
result = model([doc])

~/tensorflow_2/lib/python3.6/site-packages/doctr/models/core.py in __call__(self, documents, **kwargs)
     53         # Reorganize
     54         num_pages = [len(doc) for doc in documents]
---> 55         results = self.doc_builder(boxes, char_sequences, num_pages, [page.shape[:2] for page in pages])
     56 
     57         return results

~/tensorflow_2/lib/python3.6/site-packages/doctr/models/core.py in __call__(self, boxes, char_sequences, num_pages, page_shapes)
    204                         self._build_blocks(
    205                             page_boxes[:num_crops[page_idx]],
--> 206                             char_sequences[crop_idx: crop_idx + num_crops[page_idx]]
    207                         ),
    208                         page_idx,

~/tensorflow_2/lib/python3.6/site-packages/doctr/models/core.py in _build_blocks(self, boxes, char_sequences)
    164                         ((boxes[idx, 0], boxes[idx, 1]), (boxes[idx, 2], boxes[idx, 3]))
    165                     ) for idx in line]
--> 166                 ) for line in lines]
    167             )
    168         ]

~/tensorflow_2/lib/python3.6/site-packages/doctr/models/core.py in <listcomp>(.0)
    164                         ((boxes[idx, 0], boxes[idx, 1]), (boxes[idx, 2], boxes[idx, 3]))
    165                     ) for idx in line]
--> 166                 ) for line in lines]
    167             )
    168         ]

~/tensorflow_2/lib/python3.6/site-packages/doctr/documents/elements.py in __init__(self, words, geometry)
    108         # Resolve the geometry using the smallest enclosing bounding box
    109         if geometry is None:
--> 110             geometry = resolve_enclosing_bbox([w.geometry for w in words])
    111 
    112         super().__init__(words=words)

~/tensorflow_2/lib/python3.6/site-packages/doctr/utils/geometry.py in resolve_enclosing_bbox(bboxes)
     20 
     21 def resolve_enclosing_bbox(bboxes: List[BoundingBox]) -> BoundingBox:
---> 22     x, y = zip(*[point for box in bboxes for point in box])
     23     return ((min(x), min(y)), (max(x), max(y)))

ValueError: not enough values to unpack (expected 2, got 0)

Example of pdf causing the error

000f51e2-b734-48d9-968f-e77e3b022844.pdf

Add repr to main classes of the package

Existing classes of the package don't have any __repr__ set, which makes it difficult for new users to understand what each object is composed of.

The following classes would greatly benefit from adding a repr:

Elements and all classes inheriting from it (#102)
Models (#102)

Pb: unitest text_export_size not passing on tf 2.3.1

Unitest text_export_size not OK locally on tf 2.3.1 :

def test_export_sizes(test_convert_to_tflite, test_convert_to_fp16, test_quantize_model):
        assert sys.getsizeof(test_convert_to_tflite) > sys.getsizeof(test_convert_to_fp16)
>       assert sys.getsizeof(test_convert_to_fp16) > sys.getsizeof(test_quantize_model)
E       AssertionError: assert 3041 > 3041

[models] Recognition resize + padding should dynamically select the side for padding

For now, in recognition pre-processing, the crops are resized to a fixed height, then padded. For exotic aspect ratios, this might be a big issue, so here is my proposal:

First, we resolve which side should be selected for fixed resizing
Then, we pad the other accordingly

[scripts] Add a script for environment collection

To be able to systematically identify sources of reported issues, each bug report should come with a description of the user's environment. A script would be required to collect among others:

Python version
Tensorflow version
NVIDIA driver version
CUDA version

The user would only have to paste the result in the PR description.

[encoding] Unify string encoding for datasets and models

As of the latest commit, models are trained with a string encoder that is not particularly ordered (TF records). To clean all of this, I suggest the following:

Define pre-established vocab that we will use as encoders for your future models (#116)
Address the topic of vocab mapping (converting an accented character to its raw version for instance)
Standardize dataset label encoding using a given vocab (#116)

On the topic of vocab definition, there are two aspects to consider: the character selection, their order in the vocab.

cc @charlesmindee

[documents] How should we handle empty pages (without any text detection proposals)

Should we only leave an empty list for blocks? Or should we put a special symbol/block specifying this is an empty page?

[documents] Check for input PDF with PyMuPDF when it's source content

So far, we have used the content reading feature of PyMuPDF. When a source PDF is read, the library actually extracts all the localization and text information from the document. Two options are then available:

skipping the model inference and use this
combining it with the model inference to improve our predictions

[documents] Page elements could also be QR Code, Pictures, etc.

The current design for page elements only considers text, while the actual documents do have a much wider variety in terms of page elements. Among others, the structure should have non-text elements integrated such as:

QR Codes
Bar codes
Pictures
Signature / Signed initials
Watermarks / logo

In the end, we would need to:

Select the artefact types to be supported by the library
Implement the integration within the existing structure (#26)

[models] Add full OCR module

Design an object that will wrap all DL model components and be responsible for localizing and identifying all text elements in documents.

Inputs

a collection of documents, where each document is a list of pages, themselves expressed as numpy-encoded images.

Outputs

a collection of Document objects

The following components will be required:

DetectionPredictor (#39)
RecognitionPredictor (#39)
OCRPredictor (#39)
ElementExporter (#16, #26, #40)

[models] Add a pretrained checkpoint loading mechanism

Architecture definitions are already available in the repo, but it still lacks a way to load a set of pretrained parameters. This should be tackled for the upcoming release. Here are the requirements:

Select a checkpoint format: .ckpt
Implement necessary methods/functions to save and load the checkpoint: keras inherited (#49)
Ensure data integrity (SHA256 hash most likely) (#49)
Add a first pretrained model (#49)

[metrics] Benchmark python options for text-distance computing

Options for computing distance between 2 character sequences in python:

textdistance: full python lib
jellyfish: full python lib
strsimpy: full python lib
python-Levenshtein: C lib
edlib: C++ lib (python binding)
RapidFuzz: C++/python lib
polyleven: C/python lib

All these libs provides algorithms such as Levenshtein distance, Hamming distance, Jaro-Winkler distance, ... To compute distance between 2 strings (character sequences).

This graph taken from the RapidFuzz documentation highlights the strong dependence of runtime on string length:

We can see for example that python-Levenshtein is faster on short sequences than edlib, but much slower on long sequences. In our cases, we are typically using character sequences from 0 to 30 characters, very often between 0 and 15 characters and almost never more than 30 characters. According to these plots, it seems that polyleven, python-Levenshtein and rapidfuzzy are the fastest solutions on these typical lengths.

I conducted a short study: I called levenshtein distance 1000 times for each of these package, on 2 strings between 15 and 20 characters. Here are the results for 1000 iterations:

RapidFuzz / python-Levenshtein: 0.0002 s
Polyleven: 0.0003 s
Jellyfish: 0.0009 s
edlib: 0.002 s
textdistance: 0.0064 s
strsimpy: 0.09 s

Conclusion:
Jellyfish is fast for a full-python lib. However, as expected C libs are faster, especially RapidFuzz and python-Levenshtein on these sequence lengths. In terms of dependencies, python-Levenshtein is really light but rapidfuzz is heavier (cpp files)

@fg-mindee

[docs] Add a visualization of the example script in the README

While the readme specifies how you can use the example script, it does not show any visualization examples. We could easily add one to help users.

[utils] Add a metric computation feature for detection & recognition

For users to confirm the performance we display for pretrained models, they would need an easy way to reproduce these results. As such, I suggest that we add a doctr/utils/metrics.py module that will implement this.

As a first implementation, we could try to implement a keras inherited class for:

Text detection (#117)
Text recognition (#110)
End-to-end OCR (#122)

[documents] Add basic export module

The output object type of a document analysis should be defined as follows:

structured hierarchy as stated in the design doc
have export methods to different formats.

[docs] Enable documentation of multiple versions at once

As of now, the documentation that would be deployed publicly is only the latest version. The better alternative would be:

having the latest version by default
having the documentation of each release accessible as well using a displayed selector

Hugginface transformers did the following: https://github.com/huggingface/transformers/blob/master/.circleci/deploy.sh

[utils] Add visualization utilities

Users cannot easily visualize the results of OCR models. Some steps would have to be taken to handle this issue properly:

Establish a visualization design (considering the density of predictions to display)
Implement it in the doctr.utils module (#54)

Here is a nice thread about matplotlib dynamic display: https://stackoverflow.com/questions/7908636/possible-to-make-labels-appear-when-hovering-over-a-point-in-matplotlib
And a matplotlib-compatible library: https://mplcursors.readthedocs.io/en/stable/examples/hover.html

[models] Implement HRGAN or MASTER

This paper suggests a new architecture: Holistic Representation Guided Attention Network for text recognition model, inspired from transformers, which oustands SAR both in accuracy & speed.

We should implement this model, but the impressive speed results should be handled carefully (X8 speed compared to SAR), since the experiments are conducted on a GPU, and this model is highly parallelable (no recurrency). Is this new model so fast on CPU ?

[models] Add model compression utils

Add a doctr.models.utils module to compress existing models and improve their latency / memory load for inference purposes on CPU. Some interesting leads to investigate:

FP conversion (#10)
Quantization (#10)
Pruning (cf. https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide)
TF Lite export (#10)
ONNX export (cf. https://github.com/onnx/keras-onnx & https://github.com/onnx/tensorflow-onnx)
Export to SaveModel (#246)

Optional: TensorRT export (cf. https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/)

[api] Add a minimal API setup with DocTR

Since the library can be used to build an API, we could add a minimal codebase for users to deploy a light API for OCR on documents.

Here is a more detailed suggestion:

put everything in a api folder
implement it with FastAPI
implement a POST method "/analyze" that returns the OCR results

spread into 3 PRs:

Text recognition route (#242)
Text detection route (#245)
OCR route (#247)

[models] Properly design the preprocessing for detection & recognition

Let's review the raw inputs, and all the transformations we need to apply for the model to process it correctly.
Inputs

List of images, where each image is expressed as a numpy.ndarray of arbitrary shape (H, W, 3) and encoded in np.uint8

Transformations
Please note that the order below is important:

Detection

convert to tf.Tensor
resize each image to a fixed height & width by bilinear interpolation using TF
batch
cast to tf.float32
divide by 255
normalize

Recognition

convert to tf.Tensor
resize each image to a fixed height using bilinear interpolation using TF
pad with zeros
batch
cast to tf.float32
divide by 255
normalize

Let's proceed as follows:

Discuss & select the list of transforms
Implement the modifications / additions to the existing codebase (#50)

[datasets] Optimize dataloaders for faster iterations

The current training scripts in references are way too slow.
Let's discuss here how we can improve them

[models] Add detection module

Design a model subpart that is responsible to localize regions of interest in the document.

Input

Numpy-style encoded images (already read)

Output

localization: list of tensor of prediction. Each prediction is of size 5 (xmin, ymin, xmax, ymax, objectness)

Core components
With DetectionPredictor being the main class, with the following components:

Preprocessor (#20)
DetectionModel (#32)
Postprocessor (#24)
DetectionPredictor (#39)

[data] Handle different cases of vertical text

We need to keep in mind that we will come across 2 cases of vertical text:

"Rotated" vertical text: horizontal text with a +/-90° rotation (rotated letters)

"Truly" vertical text: a text with horizontal letters (unrotated letters), written from top to bottom

[docs] Add documentation building dependencies

Some basic documentation with proper installation and usage installations would be greatly beneficial to the library. This would be the privileged mean of communication with non-developer audiences.

Having it being automatically built using something similar to sphinx would be efficient considering all docstring will have a compatible format.

[docs] Add performance benchmark for all pretrained models

There are a few tables in the documentation that still need filling:

Text detection (#143)
Text recognition (#143)
End-to-End OCR (#143)
Comparison with similar solutions (#149)

We need to select a public dataset for each task and run the evaluation on their respective test sets, then report the results back in these tables.

[documents] Add basic document reader

For documents to be analyzed, we first need to add a utility for document reading (PDF mostly). The following specs would be nice to have:

inherit for a shared reader class ("DocumentReader" for instance)
to be located in the doctr.documents.reader module

The following formats should be handled:

PDF (#8, #25): this resource would be nice to check: https://github.com/pymupdf/PyMuPDF
PNG (#30)
JPG (#30)

cc @charlesmindee

[documents] PDF & image input page have different dimensions data formats

If we consider that the document analysis output is used for document reconstruction, a problem arises for pages. Simply put, image pages have their dimensions in pixels, while PDF have theirs in inches/centimeters and hold a DPI parameter.

Two questions have to be tackled:

uniformity: should we enforce some uniformity of dimensioning for pages whatever the format?
export: either way, which export format do we use to avoid information loss?

Demo app error when analyzing my first document

🐛 Bug

I tried to analyze a PNG and a PDF, got the same error. I try to change the model, didn't change anything.

To Reproduce

Steps to reproduce the behavior:

Upload a PNG
Click on analyze document

KeyError: 0
Traceback:
File "/Users/thibautmorla/opt/anaconda3/lib/python3.8/site-packages/streamlit/script_runner.py", line 337, in _run_script
    exec(code, module.__dict__)
File "/Users/thibautmorla/Downloads/doctr/demo/app.py", line 93, in <module>
    main()
File "/Users/thibautmorla/Downloads/doctr/demo/app.py", line 77, in main
    seg_map = predictor.det_predictor.model(processed_batches[0])[0]

Additional context

First image upload

[documents] improve line detection

An OCR predictor must detect lines, and our current version is too weak:

many lines overlap
some lines stop in the middle of a dense block whereas other lines are kind of "bridging" between separated blocks.

[models] Add minimum pretrained models

For the library to be used as an end-to-end tool, some assets need to be provided:

detection pretrained model (#62)
recognition pretrained model (#62)

Only then, we will be able to consider this as a potential MVP.

[utils] Add visualization capabilities for independent tasks

Visualization is end-to-end for the moment dynamic, but this means that a static version is not currently available, nor that there is a visualization option for text detection or text recognition only. We should discuss and add visualization for the following blocks:

Text detection: display bounding boxes of detected items over the image
Text recognition: display the label and confidence in a corner of the crop

mindee / doctr Goto Github PK

doctr's People

Contributors

Stargazers

Watchers

Forkers

doctr's Issues

🐛 Bug

To Reproduce

Example of pdf causing the error

🐛 Bug

To Reproduce

Additional context

Recommend Projects

Recommend Topics

Recommend Org