dhSegment allows you to extract content (segment regions) from different type of documents. See some examples here.
The corresponding paper is now available on arxiv, to be presented as oral at ICFHR2018.
It was created by Benoit Seguin and Sofia Ares Oliveira at DHLAB, EPFL.
See INSTALL.md
to install environment and to use dh_segment
package.
NB : a good nvidia GPU (6GB RAM at least) is most likely necessary to train your own models. We assume CUDA and cuDNN are installed.
- You need to have your training data in a folder containing
images
folder andlabels
folder. The pairs (images, labels) need to have the same name (it is not mandatory to have the same extension file, however we recommend having the label images as.png
files). - The annotated images in
label
folder are (usually) RGB images with the regions to segment annotated with a specific color - The file containing the classes has the format shown below, where each row corresponds to one class (including 'negative' or 'background' class) and each row has 3 values for the 3 RGB values. Of course each class needs to have a different code.
0 0 0
0 255 0
...
sacred
package is used to deal with experiments and trainings. Have a look at the documentation to use it properly.
In order to train a model, you should run python train.py with <config.json>
This demo shows the usage of dhSegment for page document extraction. It trains a model from scratch (optional) using the READ-BAD dataset and the annotations of pagenet (annotator1 is used). In order to limit memory usage, the images in the dataset we provide have been downsized to have 1M pixels each.
How to
- Get the annotated dataset here, which already contains the folders
images
andlabels
for training, validation and testing set. Unzip it intomodel/pages
.
cd demo/
wget https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip
unzip pages.zip
cd ..
- (Only needed if training from scratch) Download the pretrained weights for ResNet :
cd pretrained_models/
python download_resnet_pretrained_model.py
cd ..
- You can train the model from scratch with:
python train.py with demo/demo_config.json
but because this takes quite some time, we recommend you to skip this and just download the provided model (download and unzip it indemo/model
)
cd demo/
wget https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/model.zip
unzip model.zip
cd ..
- (Only if training from scratch) You can visualize the progresses in tensorboard by running
tensorboard --logdir .
in thedemo
folder. - Run
python demo.py
- Have a look at the results in
demo/processed_images