Code Monkey home page Code Monkey logo

moc_normalization's Introduction

SmartDOC-MOC Normalization Tools

Programs included in this project form the basis for checking and normalizing participants results for the challenge 2 ("Mobile OCR Challenge") of the SmartDOC competition at ICDAR 2015.

The official website for the competition is at http://l3i.univ-larochelle.fr/icdar2015smartdoc.

Summary for competition participants

The only thing you should know about this project is that the file check.py is the one you need to check that your results do not contain illegal characters.

To control one of your files, simply use:

python check.py /path/to/some/result.txt

Example with a good file:

$ python check.py test/input-dosEOL-utf8.txt 
check     INFO   : Input file contains only legal characters. Great!

Example with a bad file:

$ python check.py test/extra_chars.txt 
check     ERROR  : Got 13 illegal character(s) in line 215 : 
check     ERROR  :  l:215 c:003 LATIN SMALL LETTER DOTLESS I
check     ERROR  :  l:215 c:023 CARON
check     ERROR  :  l:215 c:025 BREVE
check     ERROR  :  l:215 c:027 DOT ABOVE
check     ERROR  :  l:215 c:029 RING ABOVE
check     ERROR  :   ... and 8 other(s).
[...]
check     ERROR  : --------------------------------------------------------------------------------
check     ERROR  : Input file contains 91 illegal characters.
check     ERROR  : Please review previous error messages and fix them before submitting your results.
check     ERROR  : --------------------------------------------------------------------------------

Requirements:

  • This program requires Python 2, (>= 2.6) and was tested on recent versions of Windows, Linux and Mac OSX.
  • Your files MUST BE encoded with UTF-8.

Package content

The current package contains the following programs:

  • check.py: checks text files to ensure they contain only legal characters
  • normalize.py: checks and normalizes participants results, and will be used before computing OCR accuracy
  • explore.py: gives line by line, character by character information about the content of an UTF-8 encoded file

It also contains several documents:

  • LICENCE: GPL-v3 license details
  • README.md: this file
  • char_mapping.ods: Spreadsheet file (Libreoffice Calc format) containing details about the allowed character set and the normalization performed

Installation

To use the programs, you will need Python 2, (>= 2.6) and was tested on recent versions of Windows, Linux and Mac OSX.

Then, simply checkout or download the programs you need, and call them from command line:

# Check whether a result is valid
python check.py /path/to/some/result.txt

# Perform the same normalization as the organizers
python normalize.py /path/to/some/result.txt /path/to/normalized/output.txt

# Review the Unicode content of a file
python explore.py /path/to/some/result.txt

You can review the command line syntax with the -h option for all programs.

Design choices

We chose to implement this solution as independent Python 2 scripts for several reasons:

  • Portability: Python 2 is widely available on many platforms
  • Simplicity: No compilation required, works from any directory, no configuration
  • Robustness: Python has excellent Unicode support
  • Openness: Participants can review, reuse and improve our methods

Licenses

All programs (check.py, explore.py, and normalize.py) are licensed under the GPL v3 license. We recommend you check http://choosealicense.com/licenses/gpl-3.0/ for an uncomplicated explanation of what is required, permitted and forbidden when you redistribute those programs.

The character mapping table (char_mapping.ods) is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Contacting the authors

Please check the official competition website at http://l3i.univ-larochelle.fr/icdar2015smartdoc to contact the authors.

moc_normalization's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.