Code Monkey home page Code Monkey logo

oldnyc's Introduction

Old NYC

Live site: http://www.oldnyc.org

The static content for this site lives in oldnyc/oldnyc.github.io. In particular you may be interested in the giant JSON data file which contains all the data served on the site.

This repo contains Python code used to generate the data for the site.

To get going on development:

git clone git://github.com/danvk/oldnyc.git
cd oldnyc
virtualenv env
source env/bin/activate
pip install -r requirements.txt

See nyc/howto.md for more details on how to perform specific tasks.

If you're interested in building your own "Old" site using this code, check out this great writeup on Old Ravenna.

oldnyc's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

oldnyc's Issues

Run OCR on higher-resolution imagery

Most of the backing text was transcribed from 1349x2048 images. In these, the individual lines have an x-height of ~12–13px.

There is 3x higher-resolution imagery available, at ~3958x6144. This gives an x-height of 35–40px. If a model were trained on this data instead of the lower-resolution version, then presumably it would be better.

Improve page load time

  • Combine & minify JavaScript
  • Try moving the initial Google Maps constructor call earlier
  • Try making the JS data bundle async

Color markers based on the dynamic range of visible dots?

Suggestion from David: once you zoom in far enough, all the markers look roughly the same shade of gray. It would be nice to dynamically change the range based on visible markers. If they're all between 1 and 10 photos, then 10 should be extremely dark.

Move comments behind a click

With text for most images (#12), there will be more useful information to occupy the space next to expanded images:

screen shot 2015-02-25 at 1 53 09 pm

The Facebook comments UI is a bit awkward there, especially on smaller screens. It should go behind a click, and probably be done using Disqus instead.

Train Ocropus model for more generations

The Ocropus model I'm using for OCR never actually bottomed out:

screen shot 2015-02-23 at 1 27 07 pm

it may keep improving with more training. There were relatively few errors it was learning from by the end. To improve it, it will need more training data.

This should be easy to generate using the existing model. Correcting transcriptions is far easier than generating them from scratch.

Reclaim top space in the grid view

There's 55px of vertical space at the top of the grid being used for absolutely nothing:

screen shot 2015-04-29 at 4 40 17 pm

If the grid went to the very top of the page, it would free up space for larger images and more metadata, especially on smaller screens.

The "X" becomes an issue if I do this. I might have to put a background on it to make it visible over the grid.

Geocode addresses which no longer exist

Old NYC uses the Google Maps geocoding API to convert cross streets to latitudes and longitudes. This mostly works fine, since the street grid hasn't changed too much in the last 100+ years. But there are situations where it's clearly an anachronism, e.g. Stuy Town:
stuytown

Surely there are photos from 15th street and Avenue A, it's just that they can't be geocoded.

A workaround would be to implement an NYC grid geocoder, or to knock off some of the more popular intersections by hand.

Lots of warnings

I see many warnings of the form: Received message of type object from http://oldnyc.github.io, expected a string:

screen shot 2015-04-29 at 4 47 10 pm

I believe this has something to do with xdomain. The errors are coming from all.js, which is from Facebook.

Stuytown is missing

Stuytown & Peter Cooper Village are conspicuously devoid of markers:
screen shot 2014-09-16 at 3 41 27 pm

Surely there are images in these locations. But they're at cross-streets that no longer exist. This is a situation where the Google Maps geocoder is clearly an anachronism.

Possible solutions would be hand-coding this area, or building a "pure grid" geocoder for Manhattan which extends the existing grid.

Remove gibberish lines

Many of these come from attempts to transcribe hand-written text, e.g.:

010009 bin → "S1417S" (720042b)
010002 bin → "D)" (720042b)
01000a bin → "e4" (720042b)
01000f bin → "S1N
7W" (720042b)

Removing these would be no big loss and would make the output look better.

Detect upside-down images

This obviously breaks OCR:

010008 bin
7w Hupaouc ''soav (iaectg pts MaoV aeN ussE teowloc 'Teeag sntunC

An example is 705124b. The thumbnail can't save me here, unfortunately; it's also rotated 180°.

The first step is to estimate how common this is. But writing a detector for upside-down text should be easy -- generating training data is as simple as flipping good images upside-down!

Detect wrapped lines

Line breaks are essential for legible OCR. OldNYC currently mirrors the original line breaks from the type-written text, e.g.

E. 38th Street, east from Third Avenue. At the left
is the GQuaker House (No 201) a renovated (1937) former
tenement. The Third Avenue buildings en the right bear
No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This looks fine when the window is wide enough, but when it's narrow, additional line breaks have to be inserted, leading to a jagged right edge.

The solution is to "unwrap" the text, i.e. by merging consecutive lines which go most of the way to the right edge:

E. 38th Street, east from Third Avenue. At the left is the GQuaker House (No 201) a renovated (1937) former tenement. The Third Avenue buildings en the right bear No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This could be done by counting characters per line, or by using the original bounding boxes for each line in the image.

Try Leaflet

http://leafletjs.com/

A few upsides I could see:

  • Faster page loads?
  • Faster zoom in/out with lots of pins?
  • Could rotate Manhattan to be vertical
  • Would enable offline development

Add the time slider back

Multiple people have requested this feature from OldSF:
screen shot 2014-09-06 at 3 16 58 pm

Challenges to doing this:

  • Pull in photo date information without bloating the initial page load too much.
  • Make the filtering happen quickly enough to be pleasant.

Log user rotations

Users clicking the "rotate" button will be a great source of data for which images are rotated.

Cluster close points

I did this for OldSF. It's cleaner and reduces page load time & memory footprint. Here's an example:

screen shot 2014-09-06 at 3 43 45 pm

or another at 49th & Lex:
screen shot 2014-09-06 at 3 45 33 pm

Get alternative recognitions for characters

At some point it may be possible to get these from Ocropus, possibly using my abandoned PR. In cases where a language model indicates that a recognized letter is highly unlikely (e.g. Plsce), a strong alternative letter recognition (e.g. sa) would be helpful.

Add a mechanism to correct OCR

This is a great place to get community help. There should be a link along the lines of "Does this transcription look weird? Improve it!" which takes you to a page with an image of the back of the photograph and a textarea containing the current transcription.

Corrections could get dumped into a Google Spreadsheet for later review.

Cluster of points for "China News"

A bunch of images with titles like "Newspapers - China Daily News Series No. 5." are getting geocoded to different points in Chinatown. I believe Google Maps is interpreting the "5" as a street address.

china news cluster

Style the map

Tone down the map colors (e.g. the water) to emphasize the photos, not the map.

Feedback on clicking a marker should be immediate

Noticed this as an issue while demoing on a spotty WiFi connection. If it takes many seconds to load data for the images, then there's no indication that work is being done. And if the request fails, nothing happens.

Flesh out About page

To do:

  • Discuss OCR work and link to blog posts
  • Incorporate Maira's content about Photographic Views of NYC
  • Call out the data for download
  • Link off to other sites with photographs of NYC

Add a tagline

to help make it clear what you're looking at. Should go under the "Old NYC" logo.

Host on github pages

GitHub pages is great for hosting static content.

Hosting OldNYC on GH-pages would involve sticking all the data in .json file which could be XHRed:

  • One JSON file containing "thick" metadata for each image (e.g. OCR text)
  • One JSON file containing "thin" information for each point (e.g. dimensions of each image)
  • One JSON file containing global information, i.e. {(lat, lon)→number of photos}

This would get me entirely off of AppEngine.

Still to do:

  • Drop xdomain
  • Fix links to about page (/about vs. /about.html)
  • Move content from xyz.html to /index.html
  • Swap CNAMEs to serve from GH pages on www
  • Replace http://oldnyc.github.io/ references with /
  • Point /rec_feedback at AppEngine

Apply spell correction to OCR

There are many trivial errors in the transcribed text which could be fixed using some knowledge of English and NYC-area nouns.

Trivial errors fall into a few classes:

  • Problems with dates
    • "JdTy 5, 1930." --> "July 5, 1930"
    • "Cune 18, i930" --> "June 18, 1930"
    • "Anril 1923"
    • "arch 1925."
    • "April 39" -> "April 30"
    • "Augst 27, 1933."
  • Number/letter mixups:
    • "S9th Sttreet"
    • "3O1" (O vs. 0)
    • "186O" (O vs 0)
    • "196h."→"1964."
    • "8treets"→"Streets"
    • "Sth"→"8th"
    • "18V8"→"1898"
    • "1s"→"is"
    • "193S"→"1935"
  • Wrong repetitions:
    • "SSecond"
    • "Sttreet"
    • "IIn"
    • "IIt"
    • "lIsland"
  • Dumb typos:
    • "Sduare"→"Square"
    • "Suare"→"Square"
    • "autoaobile"→"automobile"
    • "subsecuently"→"subsequently"
    • "New YoNk"→"New York"
    • "nrooklyn"→"Brooklyn"
    • "dorner"→"corner"
    • "fhe"→"the"
    • "Nobth"→"North"
    • "NoNth"→"North"
    • "aultiple"→"multiple"
    • "Tsland"→"Island"
    • "nearlv"→"nearly"
    • "colontal"→"colonial"
    • "wiews"→"views"
    • "Hhdson"→"Hudson"
    • "Paralueling"→"Paralleling"
    • "antioipates"→"anticipates"
    • "3roadway"→"Broadway"
    • "Muthority"→"Authority"
    • "Vest 51st"→"West 51st"
    • "14th Str et"→"14th Street"
    • "Gatholic Ghurch"→"Catholic Church"
    • "Viewa"→"Views"

Make sure the Like/Tweet buttons work

Currently they don't—hitting "Like" pops up a dialog under the "Popular Photos" panel and going off the right edge of the screen.

  • Like home screen
  • Share home screen
  • Tweet home screen
  • Like image details
  • Share image details
  • Tweet image details
  • Comment widget

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.