Code Monkey home page Code Monkey logo

deep_reader's Introduction

This code is to assess a deep neural network's ability to read text from images, with end-to-end fully differentiable architecture. That is, no bounding box, no preprocessing, just an image and a neural net.

Details:

  1. A text of up to 4 characters is generated and positioned into 40x40 grayscale image, examples below:

sample image 0 sample image 1 sample image 2

  1. With batches of 32 images, the training is performed. The architecture of the network includes a recurrent LSTM and attention, as shown on the sketch below architecture of the net

At first, the attention mask is 1. It is multiplied with the input image, and goes through 3 convolutions and then into the LSTM (to remember, where was the last character). The output from LSTM is processed with two separate fully connected layers, one generating a new attention mask, the other one generating a classification.

The training is performed with cross-entropy loss and the outputs after the terminal symbol are ignored. The program computes the accuracy as a ratio of completely correctly transcribed words, not individual letters.

  1. Performance The model achieves 87% accuracy (for whole words) in about 10000 epochs (which takes 5 minutes on nVidia Titan Xp).

  2. Discussion The model is clearly using the attention mask to select one character at a time, as shown in the images below (blue is the mask):

mask visualisation mask visualisation mask visualisation mask visualisation

The text is put into the image with some margin, hence the edges are not important to the model. Interesting fact is that the model selects a whole column, instead of a precise letter location. I'd expect that with larger noise or another image as a background, the model would forced to select more precisely.

How to run:

You will need python3.6, pytorch 0.4.1 and a GPU. Train with python3.6 main.py which will save a model file. Explore the results with python3.6 visualize.py, which will create several vis_?.gif files in the img folder.

deep_reader's People

Contributors

jaromiru avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.