danvk / oldnyc Goto Github PK

Mapping photos of Old New York

License: Apache License 2.0

Python 76.65% HTML 5.32% Shell 0.59% JavaScript 16.45% CSS 0.99%

oldnyc's Introduction

Old NYC

The static content for this site lives in oldnyc/oldnyc.github.io. In particular you may be interested in the giant JSON data file which contains all the data served on the site.

This repo contains Python code used to generate the data for the site.

To get going on development:

git clone git://github.com/danvk/oldnyc.git
cd oldnyc
virtualenv env
source env/bin/activate
pip install -r requirements.txt

See nyc/howto.md for more details on how to perform specific tasks.

If you're interested in building your own "Old" site using this code, check out this great writeup on Old Ravenna.

oldnyc's People

Stargazers

Watchers

Forkers

danibudi nypl-spacetime riordan dlamblet sanjanaanuradhmm oldhawaii luster kaushikmit alexsnet mikeklein13 eccoilmoro maggiesongyr nyagjw bdbell15 ssadok imagedeep davefaliskie elenberg abosloh dpant jeffreynghm courtneywmorris aotechou ericschles canimus pkoduganty junteudjio tuulap dharakk salah93 humanely normark fx-cc laughing429 redflower2017 spravesh1818 johnkmj mingfeng1561 achinepalli jessica-hu rodrigogonzalez liltonlili jkamlah rashmishrm ykacer jmuccigr rresol liuxia2016 devharsha qaisarrajput 1tron1 agonahmeti addls zgsxwsdxg sylvia1664 narendramp15 qwzhong1988 vishal16babu binkes go2carter itsmegaurav sishaq kedarrane ligang6383 siddharthmudgal shajalahamedcse susannahsoon curtinhive tonychouzju avishena harshadeepg preetsc27 qiyun2014 drehg conradbm yingpcao rajat1409 reneje quan-xue siva2k16 angiewang afcarl jonnycrunch favianjose ktj4820 marcydoty mast2711 miguelps sayedmohsen560 shroukk kaqiinono lopes05 csandlin1 zoomoky mario3157 jiedengye111 vspruthvi hdle priya-gittest xyzqc

oldnyc's Issues

Do something sensible on mobile

This could just be showing popular photos, mentioning that OldNYC is better on desktop & linking to the NYPL.

Turn off CloudFlare development mode

"History is out of whack" exceptions

I see this with some frequency while developing. It's unclear what causes them. It seems to be related to reloading the page.

Run OCR on higher-resolution imagery

Most of the backing text was transcribed from 1349x2048 images. In these, the individual lines have an x-height of ~12–13px.

There is 3x higher-resolution imagery available, at ~3958x6144. This gives an x-height of 35–40px. If a model were trained on this data instead of the lower-resolution version, then presumably it would be better.

Get rid of the pickle file format

It's 2014, I should just use JSON.

Improve page load time

Combine & minify JavaScript
Try moving the initial Google Maps constructor call earlier
Try making the JS data bundle async

Make the map "full bleed"

like OldSF. i.e. get rid of the header bar.

Drop "NO REPRODUCTIONS" line

NYPL is fine with this. It's just noise.

Color markers based on the dynamic range of visible dots?

Suggestion from David: once you zoom in far enough, all the markers look roughly the same shade of gray. It would be nice to dynamically change the range based on visible markers. If they're all between 1 and 10 photos, then 10 should be extremely dark.

Move comments behind a click

With text for most images (#12), there will be more useful information to occupy the space next to expanded images:

The Facebook comments UI is a bit awkward there, especially on smaller screens. It should go behind a click, and probably be done using Disqus instead.

Make "About" page part of the site

It currently takes you to another tab which is styled differently. It should feel like it's part of the same site.

Train Ocropus model for more generations

The Ocropus model I'm using for OCR never actually bottomed out:

it may keep improving with more training. There were relatively few errors it was learning from by the end. To improve it, it will need more training data.

This should be easy to generate using the existing model. Correcting transcriptions is far easier than generating them from scratch.

Migrate images to production S3 bucket

Migrate from development S3 to NYPL production S3 bucket

Resizing the window closes the expanded image panel

Link to back of photo should go to NYPL digital collections page

Reclaim top space in the grid view

There's 55px of vertical space at the top of the grid being used for absolutely nothing:

If the grid went to the very top of the page, it would free up space for larger images and more metadata, especially on smaller screens.

The "X" becomes an issue if I do this. I might have to put a background on it to make it visible over the grid.

Geocode addresses which no longer exist

Old NYC uses the Google Maps geocoding API to convert cross streets to latitudes and longitudes. This mostly works fine, since the street grid hasn't changed too much in the last 100+ years. But there are situations where it's clearly an anachronism, e.g. Stuy Town:

Surely there are photos from 15th street and Avenue A, it's just that they can't be geocoded.

A workaround would be to implement an NYC grid geocoder, or to knock off some of the more popular intersections by hand.

Lots of warnings

I see many warnings of the form: Received message of type object from http://oldnyc.github.io, expected a string:

I believe this has something to do with xdomain. The errors are coming from all.js, which is from Facebook.

Stuytown is missing

Stuytown & Peter Cooper Village are conspicuously devoid of markers:

Surely there are images in these locations. But they're at cross-streets that no longer exist. This is a situation where the Google Maps geocoder is clearly an anachronism.

Possible solutions would be hand-coding this area, or building a "pure grid" geocoder for Manhattan which extends the existing grid.

Reduce initial page load time

There's a lot of JS that gets loaded and parsed in <head>. This should all be deferred.

Missing photo of Union Square

Photo 723627f is a nice one of Union Square, but it's not on OldNYC. I wonder why?

Clicking on image thumbnail sometimes returns to map

David says "Right now, when I click one of the images here, it returns me to the map."

I see this sometimes during development, though not consistently.

Remove gibberish lines

Many of these come from attempts to transcribe hand-written text, e.g.:

→ "S1417S" (720042b)
→ "D)" (720042b)
→ "e4" (720042b)
→ "S1N7W" (720042b)

Removing these would be no big loss and would make the output look better.

Detect upside-down images

This obviously breaks OCR:

→ 7w Hupaouc ''soav (iaectg pts MaoV aeN ussE teowloc 'Teeag sntunC

An example is 705124b. The thumbnail can't save me here, unfortunately; it's also rotated 180°.

The first step is to estimate how common this is. But writing a detector for upside-down text should be easy -- generating training data is as simple as flipping good images upside-down!

Detect wrapped lines

Line breaks are essential for legible OCR. OldNYC currently mirrors the original line breaks from the type-written text, e.g.

E. 38th Street, east from Third Avenue. At the left
is the GQuaker House (No 201) a renovated (1937) former
tenement. The Third Avenue buildings en the right bear
No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This looks fine when the window is wide enough, but when it's narrow, additional line breaks have to be inserted, leading to a jagged right edge.

The solution is to "unwrap" the text, i.e. by merging consecutive lines which go most of the way to the right edge:

E. 38th Street, east from Third Avenue. At the left is the GQuaker House (No 201) a renovated (1937) former tenement. The Third Avenue buildings en the right bear No's 577 - 5 and 3; the latter being the Pet Shop.
January 7, 1939
Somach Photo Service
New York City Tunnel Authority
CREDIT LINE IMPERATIVE

This could be done by counting characters per line, or by using the original bounding boxes for each line in the image.

Try Leaflet

http://leafletjs.com/

A few upsides I could see:

Faster page loads?
Faster zoom in/out with lots of pins?
Could rotate Manhattan to be vertical
Would enable offline development

OCR text on the back of the images

e.g. this image. The text is fairly small, too small for off-the-shelf OCR to work very well, but perhaps a custom solution could work. Or we could Mechanical Turk it.

Add the time slider back

Multiple people have requested this feature from OldSF:

Challenges to doing this:

Pull in photo date information without bloating the initial page load too much.
Make the filtering happen quickly enough to be pleasant.

Log user rotations

Users clicking the "rotate" button will be a great source of data for which images are rotated.

Hitting ESC should close the grid view

and take you back to the map. (via jsvine)

Cluster close points

I did this for OldSF. It's cleaner and reduces page load time & memory footprint. Here's an example:

or another at 49th & Lex:

Re-style item view

It currently looks cobbled together, especially on larger screens.

Clicking to the right of the last image in the grid doesn't close it

Picture says it all. Marcus kept running into this during casual usage tonight.

Get alternative recognitions for characters

At some point it may be possible to get these from Ocropus, possibly using my abandoned PR. In cases where a language model indicates that a recognized letter is highly unlikely (e.g. Plsce), a strong alternative letter recognition (e.g. s→a) would be helpful.

Add a mechanism to correct OCR

This is a great place to get community help. There should be a link along the lines of "Does this transcription look weird? Improve it!" which takes you to a page with an image of the back of the photograph and a textarea containing the current transcription.

Corrections could get dumped into a Google Spreadsheet for later review.

Add structured ways for users to give feedback on particular photos

Would be nice for users to be able to specifically say that a photo is:

Rotated
Cut in half
Has a large border
Is actually multiple photos

Cluster of points for "China News"

A bunch of images with titles like "Newspapers - China Daily News Series No. 5." are getting geocoded to different points in Chinatown. I believe Google Maps is interpreting the "5" as a street address.

Style the map

Tone down the map colors (e.g. the water) to emphasize the photos, not the map.

Popular image thumbnails sometimes don't load

This could be fixed by using jquery.appear.js to defer loading until the images are visible, rather than my hacked up solution.

Hitting arrow to go to the first image on the next row scrolls to the wrong place

Go to http://www.oldnyc.org/#711304f-a (or any last image in a row, depends on your window resolution)
Hit right arrow (either the key or the arrow).
The expanded image is scrolled to the wrong spot:

Feedback on clicking a marker should be immediate

Noticed this as an issue while demoing on a spotty WiFi connection. If it takes many seconds to load data for the images, then there's no indication that work is being done. And if the request fails, nothing happens.

When you visit a photo URL directly, the photo should be the first thing to load

e.g. when you go to http://www.oldnyc.org/#722078f-b, the focused image is the very last thing on the page to visibly load. Ideally it would be the first.

Switch to static handler for root URL

See TODO: https://github.com/danvk/oldnyc/blob/master/viewer/app.yaml#L43

Flesh out About page

To do:

Discuss OCR work and link to blog posts
Incorporate Maira's content about Photographic Views of NYC
Call out the data for download
Link off to other sites with photographs of NYC

Add a tagline

to help make it clear what you're looking at. Should go under the "Old NYC" logo.

Include @nypl on Tweet link

Host on github pages

GitHub pages is great for hosting static content.

Hosting OldNYC on GH-pages would involve sticking all the data in .json file which could be XHRed:

One JSON file containing "thick" metadata for each image (e.g. OCR text)
One JSON file containing "thin" information for each point (e.g. dimensions of each image)
One JSON file containing global information, i.e. {(lat, lon)→number of photos}

This would get me entirely off of AppEngine.

Still to do:

Drop xdomain
Fix links to about page (/about vs. /about.html)
Move content from xyz.html to /index.html
Swap CNAMEs to serve from GH pages on www
Replace http://oldnyc.github.io/ references with /
Point /rec_feedback at AppEngine

Make sure "Popular Photos" are visible by default

Seems to not be the case right now.

Apply spell correction to OCR

There are many trivial errors in the transcribed text which could be fixed using some knowledge of English and NYC-area nouns.

Trivial errors fall into a few classes:

Problems with dates
- "JdTy 5, 1930." --> "July 5, 1930"
- "Cune 18, i930" --> "June 18, 1930"
- "Anril 1923"
- "arch 1925."
- "April 39" -> "April 30"
- "Augst 27, 1933."
Number/letter mixups:
- "S9th Sttreet"
- "3O1" (O vs. 0)
- "186O" (O vs 0)
- "196h."→"1964."
- "8treets"→"Streets"
- "Sth"→"8th"
- "18V8"→"1898"
- "1s"→"is"
- "193S"→"1935"
Wrong repetitions:
- "SSecond"
- "Sttreet"
- "IIn"
- "IIt"
- "lIsland"
Dumb typos:
- "Sduare"→"Square"
- "Suare"→"Square"
- "autoaobile"→"automobile"
- "subsecuently"→"subsequently"
- "New YoNk"→"New York"
- "nrooklyn"→"Brooklyn"
- "dorner"→"corner"
- "fhe"→"the"
- "Nobth"→"North"
- "NoNth"→"North"
- "aultiple"→"multiple"
- "Tsland"→"Island"
- "nearlv"→"nearly"
- "colontal"→"colonial"
- "wiews"→"views"
- "Hhdson"→"Hudson"
- "Paralueling"→"Paralleling"
- "antioipates"→"anticipates"
- "3roadway"→"Broadway"
- "Muthority"→"Authority"
- "Vest 51st"→"West 51st"
- "14th Str et"→"14th Street"
- "Gatholic Ghurch"→"Catholic Church"
- "Viewa"→"Views"

Make sure the Like/Tweet buttons work

Currently they don't—hitting "Like" pops up a dialog under the "Popular Photos" panel and going off the right edge of the screen.