Code Monkey home page Code Monkey logo

notmnist-to-mnist's Introduction

The notMNIST dataset is a image recognition dataset of font glypyhs for the letters A through J useful with simple neural networks. It is quite similar to the classic MNIST dataset of handwritten digits 0 through 9.

Unfortunately, the notMNIST data is not provided in the same format as the MNIST data, so you can't just swap in the notMNIST data files and run your neural network on it unaltered. This repo solves that problem: the four *.gz files here have the same number of entries, in the same data format as the same-named file from the MNIST dataset. But instead of handwritten digits, the images are letters from A to J (the labels are still 0 through 9). (These files are posted here with permission of the original author of the notMNIST data set.)

If you have a neural network that uses MNIST, you should be able to substitute the data files from this repo and run the program without making any changes. Note that the notMNIST dataset is harder and less clean than MNIST. A simple 2 hidden layer net that gets 98% accuracy on MNIST gets about 93 or 94% accuracy with these notMNIST files.

The notMNIST dataset is much larger than the MNIST set, so the data files here are a random sample of the notMNIST data. If you want to take a different sample or a larger sample, you can use the python script in this directory to process notMNIST yourself.

Instructions:

  1. if you already have a MNIST data/ directory, rename it and create a new one with code like this:
mv data data.original_mnist
mkdir data
  1. Download and unpack the notMNIST data. The files are not particularly large, but unpacking them can take a long time because there are well over 500,000 individual image files.
curl -o notMNIST_small.tar.gz http://yaroslavvb.com/upload/notMNIST/notMNIST_mall.tar.gz
curl -o notMNIST_large.tar.gz http://yaroslavvb.com/upload/notMNIST/notMNIST_arge.tar.gz
tar xzf notMNIST_small.tar.gz
tar xzf notMNIST_large.tar.gz
  1. Finally, run this script to convert the data to MNIST files in your data/ directory and compress them:
python convert_to_mnist_format.py notMNIST_small 1000 data/t10k-labels-idx1-uyte data/t10k-images-idx3-ubyte
python convert_to_mnist_format.py notMNIST_large 6000 data/train-labels-idx1-byte data/train-images-idx3-ubyte
gzip data/*ubyte

The first command line above says that the test files should include 1000 entries for each of the 10 letters, and that the training files should include 6000 entries for each of the 10 letters. This matches the size of the MNIST files. notMNIST is significantly bigger than MNIST, however, and you can probably use numbers as large as 1800 for the test files and 50000 for the training files.

notmnist-to-mnist's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.