It would be great if donut has ability to extract the bounding boxes of each entity ex

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, thanks to <a class="user-mention notranslate" data-hovercard-type="user" data-hove

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've found updates at <a class="issue-link js-issue-link" data-error-text="Failed to l

How to get the bounding boxes of the extracted entities? about donut HOT 6 CLOSED

WeiquanWa commented on September 7, 2024

How to get the bounding boxes of the extracted entities?

from donut.

Comments (6)

SamSamhuns commented on September 7, 2024 7

@gwkrsrch, could you give us the code that was used to generate the heatmap visualization in Figure 8 of the DONUT paper?

from donut.

gwkrsrch commented on September 7, 2024 3

Hi, thanks to @logan-markewich for the helpful comment :)

donut does not require any bounding box annotation/supervision during the model training. But, as a result, there are no actual boxes in the model output. Instead, you can get an attention heatmap that could be used for your purpose. See Figure 8 of https://arxiv.org/abs/2111.15664 also. The related code line is at:

https://github.com/clovaai/donut/blob/1.0.5/donut/model.py#L492

You may convert the heatmap to bounding boxes. The following link might be useful to you:

https://stackoverflow.com/a/58421765

Hope this helps. Please let me know if you are still confused.

from donut.

SamSamhuns commented on September 7, 2024 2

@WeiquanWa , did you manage to get some semblance of bounding boxes or the cross-attention heatmap from the outputs?

I cannot interpret the structure of the output attention maps from "cross_attentions": decoder_output.cross_attentions.

I see it is a tuple of tuples with the outer length being equal to the number of tokens (len(decoder_output.sequences)) but there are 4 sub-tuples inside each of shape torch.Size([1, 16, 1, 1200]). Not sure how to get representative heatmaps from these tensors.

from donut.

logan-markewich commented on September 7, 2024

As far as I know, there isn't actually any bounding boxes. The image is encoded into features, but not actual boxes.

If you need boxes, you are better off using traditional OCR + modelling (layoutlmv2/3 are great options for this approach)

from donut.

leitouran commented on September 7, 2024

I second this question. I assume these attention masks are to be translated to the prior stage (before they were encoded) to be able to match with the actual image shape, but I can't seem to figure out how to do this.

Like suggested in #31, I think Donut would benefit greatly from returning bounding boxes to allow further post-processing and output validation using fuzzy matching approaches with OCR results.

from donut.

SamSamhuns commented on September 7, 2024

I've found updates at #45

from donut.

Recommend Projects

How to get the bounding boxes of the extracted entities? about donut HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent