monniert / docextractor Goto Github PK

View Code? Open in Web Editor NEW

82.0 7.0 9.0 4.19 MB

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

Home Page: https://www.tmonnier.com/docExtractor

License: MIT License

Shell 0.56% Python 99.44%

document-analysis segmentation historical-data pytorch

docextractor's Introduction

docExtractor

Pytorch implementation of "docExtractor: An off-the-shelf historical document element extraction" paper (accepted at ICFHR 2020 as an oral)

Check out our paper and webpage for details!

If you find this code useful, don't forget to star the repo ⭐ and cite the paper:

@inproceedings{monnier2020docExtractor,
  title={{docExtractor: An off-the-shelf historical document element extraction}},
  author={Monnier, Tom and Aubry, Mathieu},
  booktitle={ICFHR},
  year={2020},
}

Installation 👷

Prerequisites

Make sure you have Anaconda installed (version >= to 4.7.10, you may not be able to install correct dependencies if older). If not, follow the installation instructions provided at https://docs.anaconda.com/anaconda/install/.

1. Create conda environment

conda env create -f environment.yml
conda activate docExtractor

2. Download resources and models

The following command will download:

our trained model
SynDoc dataset: 10k generated images with line-level page segmentation ground truth
synthetic resources needed to generate SynDoc (manually collected ones and Wikiart dataset)
IlluHisDoc dataset

./download.sh

NB: it may happen that gdown hangs, if so you can download them by hand, then unzip and move them to appropriate folders (see corresponding scripts):

How to use 🚀

There are several main usages you may be interested in:

perform element extraction (off-the-shelf using our trained network or a fine-tuned one)
build our segmentation method from scratch
fine-tune network on custom datasets

Demo: in the demo folder, we provide a jupyter notebook and its html version detailing a step-by-step pipeline to predict segmentation maps for a given image.

1. Element extraction

CUDA_VISIBLE_DEVICES=gpu_id python src/extractor.py --input_dir inp --output_dir out

Main args

-i, --input_dir: directory where images to process are stored (e.g. raw_data/test)
-o, --output_dir: directory where extracted elements will be saved (e.g. results/output)
-t, --tag: model tag to use for extraction (default is our trained network)
-l, --labels: labels to extract (default corresponds to illustration and text labels)
-s, --save_annot: whether to save full resolution annotations while extracting

Additionals

-sb, --straight_bbox: whether to use straight bounding boxes instead of rotated ones to fit connected components
-dm, --draw_margin: Draw the margins added during extraction (for visual or debugging purposes)

NB: check src/utils/constant.py for labels mapping
NB: the code will automatically run on CPU if no GPU are provided/found

2. Build our segmentation method from scratch

This would result in a model similar to the one located in models/default/model.pkl.

a) Generate SynDoc dataset

You can skip this step if you have already downloaded SynDoc using the script above.

python src/syndoc_generator.py -d dataset_name -n nb_train --merged_labels

-d, --dataset_name: name of the resulting synthetic dataset
-n, --nb_train: nb of training samples to generate (0.1 x nb_train samples are generated for val and test splits)
-m, --merged_labels: whether to merge all graphical and textual labels into unique illustration and text labels

b) Train neural network on SynDoc

CUDA_VISIBLE_DEVICES=gpu_id python src/trainer.py --tag tag --config syndoc.yml

3. Fine-tune network on custom datasets

To fine-tune on a custom dataset, you have to:

annotate a dataset pixel-wise: we recommend using VGG Image Anotator (link) and our ViaJson2Image tool
split dataset in train, val, test and move it to datasets folder
create a configs/example.yml config with corresponding dataset name and a pretrained model name (e.g. default)
train segmentation network with python src/trainer.py --tag tag --config example.yml

Then you can perform extraction with the fine-tuned network by specifying the model tag.

NB: val and test splits can be empty, you just won't have evaluation metrics

Annex tools 🛠️

1. ViaJson2Image - Convert VIA annotations into segmentation map images

python src/via_converter.py --input_dir inp --output_dir out --file via_region_data.json

NB: by default, it converts regions using the illustration label

2. SyntheticLineGenerator - Generate a synthetic line dataset for OCR training

python src/synline_generator.py -d dataset_name -n nb_doc

NB: you may want to remove text translation augmentations by modifying synthetic.element.TEXT_FONT_TYPE_RATIO.

3. IIIFDownloader - Download IIIF images data from manifest urls

python src/iiif_downloader.py -f filename.txt -o output_dir --width W --height H

-f, --file: file where each line contains an url to a json manifest
-o, --output_dir: directory where downloaded images will be saved
--width and --height: image width and height respectively (default is full resolution)

NB: Be aware that if both width and height arguments are specified, aspect ratio won't be kept. Common usage is to specify a fixed height only.

docextractor's People

Contributors

Stargazers

Watchers

Forkers

hell-to-heaven seekingdeep ngonthier gyanendrol9 niharikavadlamudi mhleal bencomp hassanhajj910 fchang-smith

docextractor's Issues

from line level to word level?

Hi. docExtractor is doing a great job! Any suggestions for getting from the line level to individual word level? Does this capability perhaps exist already? Or could you make any recommendations -- either for augmenting docExtractor or perhaps something that already exists in python elsewhere?

Post-processing step

Hi @monniert

Thank you very much for sharing the code with us.

When running the "tester.py" script, the results are obtained by using only the trained model or the trained model followed by the post-processing step?

Thank you in advance

Demo website down

Hi!
I read your paper and viewed your video with interest, and I would like to explore using your code for my application - getting layout segmentation from ~100-year-old newspapers. So I downloaded the repo, but in trying to set up the Anaconda environment, I discovered that you are using a number of dependencies that are Linux specific and not available for Windows. If there are no versions available for Windows, I can set up Windows Subsystem for Linux (WSL) and use it that way. But I really would like to see how your code can handle some of examples of images of newspaper pages before I go to the trouble of setting WSL up. So I went to your demo website - https://enherit.paris.inria.fr/ to see if I could use it for this evaluation - but it is down. Could you please establish a new demo website so I can evaluate your repo?
Thanks!

[suggestion] loading the data on the fly

A good suggestion would be an option of loading the data on the fly, which means instead of loading all the images for training/ predicting all-at-once into the memory, we only load the data as portions.
Even-though this option might increase the time to process the data, it is surely beneficial when dealing with huge number of images for training/predicting, and thus ensuring the ability to deal with huge datasets while maintaining the ram usage.

examples:
--train_on_fly_images 100: 100 images to be loaded into the ram at a single time
--train_on_fly_json 10: a complete 10 json files to be loaded into the ram at a single timeNote:train_on_fly_json` would be used when having multiple .json files for training.
a single .json file can containing multiple images.

conda conflicts

Hi guys,

I really like your work and want to use it in my project. But a big problem I have right now is the conda environment... I tried to setup the environment with your environment.yml but I got a HUGE BUNCH of package conflicts.

Is there anything I'm missing? Or the environment is indeed inconsistent and I can just do it with pip while ignoring all the conflicts?

I'm using Anaconda 4.7.10

Thanks for your time!

Greetings from Germany,
Nicole

Training a Text-Line detector and want to create annotations with x-height+ border automatically

@monniert
I trained a text-line detector model, the accuracy seemed high, but when i tested, the results were very bad. i even tried training at different image sizes, but still the results were not good.
My Guess is that the ground truth should not be all in the same color "cyan", you might need to choose 2 colors, example: first line "cyan", then second line "red", then 3rd line "cyan", then 4th line "red", etc..... This might help in separating close regions.

possible solution:

Example:

Groundtruth:

Original image:

The process of GT generation

First of all, thank you very much for your work! docExtractor extracts text lines very well out of the box.
But I want to fine-tune the model with custom data and I think my question is related to the process of creating the GT.
As you stated in #10 you recommend to add some border around the annotated text lines. As I went through the examples on https://enherit.paris.inria.fr/ it seems that borders are not annotated explicitly.

Should borders be annotated (labeled as border) or do you recommend adding them via morphological operations in some kind of a post-process?

Moreover I'm not quite sure which labels where used when the model has been trained. On https://enherit.paris.inria.fr/ the text lines are labeled as text but the paper states labels like paragraph or table.

What labels should be used when using the default model as a pretrained model?

In my case I work with tabular data. Should text inside the cells therefore be labeled as table?

Thank you in advance!

Problem with PolynomialLR

Thank you very much for sharing the code with us. Recently, I inspect and test all the code that you provided. I notice that your custom PolynomialLR inside the schedulers packet returns the result the same as the ConstantLR does. Please let me know if it is a bug. Thank you in advance.

error

[bug] translation.exception.TranslateError: No translation get, you may retry

@monniert The error:

(docExtractor) home@home-lnx:~/programs/docExtractor$ python src/syndoc_generator.py -d testing -n 100 --merged_labels
[2020-11-29 00:49:37] Creating train set...
[2020-11-29 00:49:37]   Generating random document with seed 0...
Traceback (most recent call last):
  File "src/syndoc_generator.py", line 62, in <module>
    gen.run(args.nb_train)
  File "src/syndoc_generator.py", line 46, in run
    d = SyntheticDocument(**kwargs)
  File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
    return f(*args, **kw)
  File "/home/home/programs/docExtractor/src/synthetic/document.py", line 126, in __init__
    self.elements, self.positions = self._generate_random_layout()
  File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
    return f(*args, **kw)
  File "/home/home/programs/docExtractor/src/synthetic/document.py", line 238, in _generate_random_layout
    element = choice(self.available_elements, p=weights)(width, height, **element_kwargs)
  File "/home/home/programs/docExtractor/src/synthetic/element.py", line 151, in __init__
    self.generate_content(seed=seed)
  File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
    return f(*args, **kw)
  File "/home/home/programs/docExtractor/src/synthetic/element.py", line 597, in generate_content
    self.text, content_width, content_height = self.format_text(text)
  File "/home/home/programs/docExtractor/src/synthetic/element.py", line 624, in format_text
    text = google(text, src='en', dst='ar')
  File "/home/home/anaconda3/envs/docExtractor/lib/python3.6/site-packages/translation/__init__.py", line 19, in google
    dst = dst, proxies = proxies)
  File "/home/home/anaconda3/envs/docExtractor/lib/python3.6/site-packages/translation/main.py", line 33, in get
    if r == '': raise TranslateError('No translation get, you may retry')
translation.exception.TranslateError: No translation get, you may retry

Listing the folder's tree:

(docExtractor) home@home-lnx:~/programs/docExtractor$ tree -d
.
├── configs
├── demo
├── models
│   └── default
├── raw_data
│   └── illuhisdoc
│       ├── msd
│       ├── msi
│       ├── mss
│       ├── p
│       └── via_json
├── scripts
├── src
│   ├── datasets
│   ├── loss
│   ├── models
│   ├── optimizers
│   ├── schedulers
│   ├── synthetic
│   │   └── __pycache__
│   └── utils
│       └── __pycache__
└── synthetic_resource
    ├── background
    │   ├── 0
    │   ├── 10
    │   ├── 100
    │   ├── 110
    │   ├── 120
    │   ├── 20
    │   ├── 30
    │   ├── 40
    │   ├── 50
    │   ├── 60
    │   ├── 70
    │   ├── 80
    │   └── 90
    ├── context_background
    ├── drawing
    ├── drawing_background
    ├── font
    │   ├── arabic
    │   │   ├── Amiri
    │   │   ├── Arial
    │   │   ├── Cairo
    │   │   ├── dejavu_dejavu-sans
    │   │   ├── El_Messiri
    │   │   ├── gnu-freefont_freeserif
    │   │   └── st-gigafont-typefaces_code2003
    │   ├── chinese
    │   │   ├── Liu_Jian_Mao_Cao
    │   │   ├── Long_Cang
    │   │   ├── Ma_Shan_Zheng
    │   │   ├── Noto_Sans_SC
    │   │   ├── Noto_Serif_SC
    │   │   ├── ZCOOL_KuaiLe
    │   │   ├── ZCOOL_QingKe_HuangYou
    │   │   ├── ZCOOL_XiaoWei
    │   │   └── Zhi_Mang_Xing
    │   ├── foreign_like
    │   │   ├── alhambra
    │   │   ├── barmee_afarat-ibn-blady
    │   │   ├── bizancia
    │   │   ├── catharsis_bedouin
    │   │   ├── catharsis_catharsis-bedouin
    │   │   ├── k22_timbuctu
    │   │   ├── kingthings_conundrum
    │   │   ├── meifen
    │   │   ├── ming_imperial
    │   │   ├── running_smobble
    │   │   ├── samarkan
    │   │   ├── selamet_lebaran
    │   │   ├── uddi-uddi_running-smobble
    │   │   ├── yozakura
    │   │   └── zilap_oriental
    │   ├── handwritten
    │   │   ├── Alako
    │   │   ├── Angelina
    │   │   ├── anke-print
    │   │   ├── atlandsketches-bb
    │   │   ├── bathilda
    │   │   ├── BlackJack_Regular
    │   │   ├── blzee
    │   │   ├── bromello
    │   │   ├── calligravity
    │   │   ├── Carefree
    │   │   ├── conformity
    │   │   ├── Cursive_standard
    │   │   ├── Damion
    │   │   ├── Elegant
    │   │   ├── emizfont
    │   │   ├── hoffmanhand
    │   │   ├── honey_script
    │   │   ├── hurryup
    │   │   ├── irezumi
    │   │   ├── james-tan-dinawanao
    │   │   ├── JaneAusten
    │   │   ├── Jellyka_-_Love_and_Passion
    │   │   ├── jr-hand
    │   │   ├── Juergen
    │   │   ├── khand
    │   │   ├── kosal-says-hy
    │   │   ├── Learning_Curve
    │   │   ├── Learning_Curve_Pro
    │   │   ├── maddison_signature
    │   │   ├── may-queen
    │   │   ├── mistis-fonts_october-twilight
    │   │   ├── mistis-fonts_stylish-calligraphy-demo
    │   │   ├── mistis-fonts_watermelon-script-demo
    │   │   ├── Monika
    │   │   ├── mumsies
    │   │   ├── nymphont_xiomara
    │   │   ├── otto
    │   │   │   └── Otto
    │   │   ├── Pacifico
    │   │   ├── paul_signature
    │   │   ├── pecita
    │   │   ├── popsies
    │   │   ├── quigleywiggly
    │   │   ├── rabiohead
    │   │   ├── roddy
    │   │   ├── Saginaw
    │   │   ├── Saginaw 2
    │   │   ├── santos-dumont
    │   │   ├── scribble
    │   │   ├── scriptina
    │   │   ├── sf-burlington-script
    │   │   │   └── TrueType
    │   │   ├── sf-foxboro-script
    │   │   │   └── TrueType
    │   │   ├── shadows-into-light
    │   │   ├── shartoll-light
    │   │   ├── shelter-me
    │   │   │   └── kimberly-geswein_shelter-me
    │   │   ├── shorelines_script
    │   │   ├── signerica
    │   │   ├── sild
    │   │   ├── silent-fighter
    │   │   │   └── Silent Fighter
    │   │   ├── sillii_willinn
    │   │   ├── silverline-script-demo
    │   │   ├── simple-signature
    │   │   ├── snake
    │   │   │   └── Snake
    │   │   ├── somes-style
    │   │   ├── sophia-bella-demo
    │   │   ├── spitter
    │   │   │   └── Spitter
    │   │   ├── stalemate
    │   │   ├── standard-pilot-demo
    │   │   │   └── standard pilot demo
    │   │   ├── stingray
    │   │   │   └── Stingray
    │   │   ├── stylish-marker
    │   │   ├── Sudestada
    │   │   ├── sunshine-in-my-soul
    │   │   │   └── kimberly-geswein_sunshine-in-my-soul
    │   │   ├── sweet-lady
    │   │   ├── Tabitha
    │   │   ├── the-girl-next-door
    │   │   ├── the-great-escape
    │   │   │   └── kimberly-geswein_the-great-escape
    │   │   ├── the-illusion-of-beauty
    │   │   ├── theodista-decally
    │   │   ├── the-only-exception
    │   │   │   └── kimberly-geswein_the-only-exception
    │   │   ├── the-queenthine
    │   │   │   └── The Queenthine demo
    │   │   ├── the_wave
    │   │   ├── think-dreams
    │   │   ├── toubibdemo
    │   │   ├── turkeyface
    │   │   ├── typhoon-type-suthi-srisopha_sweet-hipster
    │   │   ├── undercut
    │   │   ├── variane-script
    │   │   ├── velocity-demo
    │   │   ├── vengeance
    │   │   ├── victorisa
    │   │   ├── waiting-for-the-sunrise
    │   │   ├── watasyina
    │   │   ├── westbury-signature-demo-version
    │   │   │   └── Westbury-Signature-Demo-Version
    │   │   ├── white_angelica
    │   │   ├── wiegel-kurrent
    │   │   ├── wiegel-latein
    │   │   ├── Windsong
    │   │   ├── winkdeep
    │   │   ├── wolgast-two
    │   │   ├── wonder_bay
    │   │   ├── written-on-his-hands
    │   │   │   └── kimberly-geswein_written-on-his-hands
    │   │   ├── you-wont-bring-me-down
    │   │   └── zeyada
    │   └── normal
    │       ├── alexey-kryukov_theano
    │       ├── daniel-johnson_didact-gothic
    │       ├── david-perry_cardo
    │       ├── dejavu_dejavu-sans
    │       ├── dejavu_dejavu-serif
    │       ├── ek-type_ek-mukta
    │       ├── georg-duffner_eb-garamond
    │       ├── gnu-freefont_freemono
    │       ├── gnu-freefont_freesans
    │       ├── gnu-freefont_freeserif
    │       ├── google_noto-sans
    │       ├── google_noto-serif
    │       ├── google_roboto
    │       ├── gust-e-foundry_texgyreschola
    │       ├── gust-e-foundry_texgyretermes
    │       ├── james-kass_code2000
    │       ├── kineticplasma-fonts_din-kursivschrift
    │       ├── kineticplasma-fonts_falling-sky
    │       ├── kineticplasma-fonts_mechanical
    │       ├── kineticplasma-fonts_trueno
    │       ├── linux-libertine_linux-libertine
    │       ├── m-fonts_m-2p
    │       ├── nymphont_aver
    │       ├── red-hat-inc_liberation-sans
    │       ├── sil-international_charis-sil
    │       ├── sil-international_doulos-sil
    │       ├── sil-international_doulos-sil-compact
    │       ├── sil-international_gentium-book-basic
    │       ├── sil-international_gentium-plus
    │       └── st-gigafont-typefaces_code2003
    ├── glyph_font
    │   ├── ababil-script-demo
    │   │   └── MJ Ababil Demo
    │   ├── aldus_regal
    │   ├── aldus_romant
    │   ├── aldus_royal
    │   ├── anglo-text
    │   ├── art-designs-by-sue_fairies-gone-wild
    │   ├── art-designs-by-sue_fairies-gone-wild-plus
    │   ├── camelotcaps
    │   ├── cameoappearance
    │   ├── character_cherubic-initials
    │   ├── character_masselleam
    │   ├── character_romantique-initials
    │   ├── cheap-stealer
    │   │   └── cheap stealer
    │   ├── cheshire-initials
    │   ├── chung-deh-tien-chase-zen_chase-zen-jingletruck-karachi
    │   ├── cloutierfontes_british-museum-1490
    │   ├── colchester
    │   ├── dan-roseman_chaucher
    │   ├── decorated-roman-initials
    │   ├── digital-type-foundry_burton
    │   ├── dominatrix
    │   ├── ds-romantiques
    │   ├── egyptienne-zierinitialien
    │   ├── ehmcke-fraktur-initialen
    │   ├── ehmcke-schwabacher-initialen
    │   ├── elzevier-caps
    │   ├── eva-barabasne-olasz_kahirpersonaluse
    │   ├── extraornamentalno2
    │   ├── fleurcornercaps
    │   ├── flowers-initials
    │   ├── gate-and-lock-co_metalover
    │   ├── gemfonts_gothic-illuminate
    │   ├── genzsch-initials
    │   ├── george-williams_andrade
    │   ├── george-williams_floral-caps-nouveau
    │   ├── george-williams_morris
    │   ├── george-williams_square-caps
    │   ├── germanika-personal-use
    │   │   └── Germanika Personal Use
    │   ├── griffintwo
    │   ├── house-of-lime_fleurcornercaps
    │   ├── house-of-lime_german-caps
    │   ├── house-of-lime_gothic-flourish
    │   ├── house-of-lime_lime-blossom-caps
    │   ├── house-of-lime_limeglorycaps
    │   ├── intellecta-design_centennialscriptfancy-three
    │   ├── intellecta-design_hard-to-read-monograms
    │   ├── intellecta-design_holbeinchildrens
    │   ├── intellecta-design_intellecta-monograms-random-eight
    │   ├── intellecta-design_intellecta-monograms-random-sam
    │   ├── intellecta-design_intellecta-monograms-random-six
    │   ├── intellecta-design_intellecta-monograms-random-two
    │   ├── intellecta-design_jaggard-two
    │   ├── intellecta-design_nardis
    │   ├── jlh-fonts_apex-lake
    │   ├── kaiserzeitgotisch
    │   ├── kanzler
    │   ├── kr-keltic-one
    │   ├── lime-blossom-caps
    │   ├── lord-kyl-mackay_floral-majuscules-11th-c
    │   ├── lord-kyl-mackay_gothic-leaf
    │   ├── lorvad_spatz
    │   ├── manfred-klein_delitschinitialen
    │   ├── manfred-klein_lombardi-caps
    │   ├── manfred-klein_vespasiancaps
    │   ├── manfred-klein_vespasiansflorials
    │   ├── medici-text
    │   ├── medievalalphabet
    │   ├── morris-initialen
    │   ├── napoli-initialen
    │   ├── neugotische-initialen
    │   ├── nouveau-drop-caps
    │   ├── paisleycaps
    │   ├── pamela
    │   ├── panhead
    │   ├── paulus-franck-initialen
    │   ├── pau-the-1st
    │   ├── precious
    │   ├── rediviva
    │   ├── rothenburg-decorative
    │   ├── royal-initialen
    │   ├── rudelsberg
    │   ├── sentinel
    │   ├── sniper
    │   ├── spring
    │   ├── the-black-box_seven-waves-sighs-salome
    │   ├── tulips
    │   ├── typographerwoodcutinitialsone
    │   ├── unger-fraktur-zierbuchstaben
    │   ├── victorian-initials-one
    │   ├── vtks-deja-vu
    │   ├── vtks-focus
    │   ├── vtks-mercearia
    │   ├── vtks-simplex-beauty-2
    │   ├── vtks-sonho
    │   ├── vtks-velhos-tempos
    │   ├── waste-of-paint
    │   ├── west-wind-fonts_exotica
    │   ├── west-wind-fonts_leafy
    │   ├── zallman-caps
    │   └── zamolxis_zamolxisornament
    ├── noise_pattern
    │   ├── border_hole
    │   ├── center_hole
    │   ├── corner_hole
    │   └── phantom_character
    ├── text
    └── wikiart
        ├── Abstract_Expressionism
        ├── Action_painting
        ├── Analytical_Cubism
        ├── Art_Nouveau_Modern
        ├── Baroque
        ├── Color_Field_Painting
        ├── Contemporary_Realism
        ├── Cubism
        ├── Early_Renaissance
        ├── Expressionism
        ├── Fauvism
        ├── High_Renaissance
        ├── Impressionism
        ├── Mannerism_Late_Renaissance
        ├── Minimalism
        ├── Naive_Art_Primitivism
        ├── New_Realism
        ├── Northern_Renaissance
        ├── Pointillism
        ├── Pop_Art
        ├── Post_Impressionism
        ├── Realism
        ├── Rococo
        ├── Romanticism
        ├── Symbolism
        ├── Synthetic_Cubism
        └── Ukiyo_e

362 directories

via_converter.py generate with boarders

@monniert
i generated the groundtruth masks using the via_converter.py script that you included, they are created without the boarders, as when i used Via Annotator i was boxing the textlines.
i think it would be better to add the option to generate with boarders when using via_converter.py.

[suggestion] save detected regions as vgg-json

@monniert

A good suggestion would be to save the predicted regions as a VGG Image Anotator .json file, even if the order of the predicted regions is wrong.

--output_json ./output/detected.json
--ignore_readingorder

bug -- tester.py

Hi @monniert

Thank you very much for sharing the code with us.

When running the "tester.py" script, there is the following bug related to Image.blend (Ihave checked that pred_img.size = img.size). Is it related to resize function ?

(926, 1280)
(926, 1280)
Traceback (most recent call last):
File "tester.py", line 122, in
tester.run()
File "tester.py", line 66, in run
self.save_prob_and_seg_maps()
File "tester.py", line 101, in save_prob_and_seg_maps
blend_img = Image.blend(img, pred_img, alpha=0.4)
File "/home/pejc/anaconda2/envs/layout/lib/python3.8/site-packages/PIL/Image.py", line 3011, in blend
return im1._new(core.blend(im1.im, im2.im, alpha))
ValueError: images do not match

Thank you in advance

can't get wikiart.zip to download

Hi. Thanks for the great tool!

I can't get the wikiart.zip to download. First I was running the script included in the package, but it would always time out. Then I went directly to , but I get a message (in Chrome) that "this site can't be reached."

Is there any other way to obtain this resource?

Thanks in advance.

[donate] include FUNDING.yml to accept donations

This project seems interesting, and in-order to insure the continuity of development and improvement i would suggest adding a FUNDING.yml so that you can accept donations.

My main interest in this project:

Simplicity and straightforwardness: directly input .json files and start training/finetuning. directly predict and output .json file.
Scalability and efficiency: the training/predicting data might contain millions of images, therefore it would be reasonable instead of preloading all the data during the training, to only load the data on the fly, meaning the required number of files.

Does these suggestions align with the goal of this project?
Do you accept donations?

where is the UI?

I'm going through the code and I can't figure out how to run the UI.

I saw that the demo is down and that you are not able to bring it back up (issue #18 ). So I tried to do it myself but I only see the code to segment the regions. In the video it seems that there is some sort of UI to navigate the book and the annotated regions. How can I start that program/server?

quality/resolution of image results

Hello again.

The images that result from running docExtractor (as found in the "text" and "illustration" folders) -- should I expect these to be full resolution with respect to the original? Or is some reduction or loss involved?

(I did my own little (if crude) test. Here is a clip resulting from docExtractor:

The file size is 33.7 KB.

Here is a crop I made from the original:

The file size is 70.4 KB.

I confess I don't know enough about digital imagery, how files are saved, etc to know if this concludes anything or not. :) Regardless, my goal is to have crops made using docExtractor that are lossless.)

[suggestion] directly input vgg.json for training from scratch or finetuning

A good suggestion would be an option to directly input a vgg.json for training from scratch or finetuning
perhaps even add an option to input multiple vgg.json files at once.

--input_json ./input/*.json

[bug] KeyError: 'filename'

input.zip

(docExtractor) home@home-lnx:~/programs/docExtractor$ python src/via_converter.py --input_dir input/ --output_dir output/ --file ./test.json 
Traceback (most recent call last):
  File "src/via_converter.py", line 80, in <module>
    conv.run()
  File "src/via_converter.py", line 36, in run
    img = self.convert(annot)
  File "src/via_converter.py", line 41, in convert
    name = annot['filename']
KeyError: 'filename'

[suggestion] Store datasets and models in a data archive

It would be great if you could store generated models and datasets that you collected in an archive like Zenodo. Zenodo would give the dataset (which requires a bit of metadata) a DOI for easier citation and it provides stronger long-term promises than Google Drive or Dropbox. (IMHO providing Google Drive links for important data looks phishy, although I know many other projects do the same.)

Thanks for your consideration!

Parallel Prediction?

Hi @monniert,

Good day :)

I'm using docExtractor in my project to extract textlines. Problem is, I have a LOT of pages. Therefore I'm planning on using multiprocessing to do it in parallel. Is it already an option in this project? Or do you have any suggestions on improving the efficiency of such extraction for me?

Thanks a lot!

Best,
Nicole

Trying to train a Text Region detector but failed

@monniert Hi there,
i have trained a new model to detect text regions/ paragraphs, the results were bad eventhough in training and validation the accuracy was high. The sample dataset
https://drive.google.com/drive/folders/1bCuI9SYXOuRUeP4MXY0gfcaKu6O3_WlM?usp=sharing