Hacker News new | comments | show | ask | jobs | submit login
Map Vectorizer – Map polygon and feature extractor (github.com)
64 points by linux_devil 1400 days ago | hide | past | web | 19 comments | favorite

I've always wanted the equivalent of * Maps with a time dimension. The data is constantly changing, but imagine if you could push the slider back hundreds of years and see how a city evolved. There is a whole "dark" dimension of data out there that is only captured in print. (e.g. NYC business directories from the 19th/20th century -- being able to see what a particular address used to be) Adding historical map data is the base layer for capturing this data from print.

I'm part of the team at NYPL Labs that's been working on these historical geospatial projects for the past few years (The Vectorizer is the work of our own @MGA), and that's EXACTLY what we've been working toward. For almost 4 years, we've had staff and volunteers going over scans of geo-rectified (stitching and stretching a raster image so it aligns with geospatial coordinates) historical insurance maps of NYC meticulously extracting the information on there. Namely we go after the outlines of buildings to capture the amazingly detailed datapoints these maps had about every building in the city all the way back to the first half of the 19th century.

These are nice to have for researchers, but the real purpose of collecting this is, just as you note, to unlock the hidden historical geospatial data in textual materials. Once we've got all those names of places, their addresses, their lat/lon coordinates, and their timeframes of existence, we can start to search through texts to find linkages. Old city directories (they're basically books of ghosts) start to show you who lived and worked where [1] (and in the process starts to get you more names you can associate with these places), address matches in historical newspapers start to show you what happened in these places, and the maps start to become this geospatial backbone to traverse across tons of different datasets.

The Vectorizer is so freaking cool for so many reasons, but mostly because it's going to let us actually get through these insurance atlases to collect this data before we all die (one of our favorites is the 1854 William Perris Atlas [2][3] but it took nearly 3 years to actually get through the 64,000+ buildings in Manhattan south of 42nd st) so we can start doing this kind of querying with it. The real geniuses behind all this, our Geospatial Librarian Matt Knutzen and the team at Topomancy, have been working on an experimental gazetteer [4] so that we'll finally have this as a public web service for people to hack on all these places as we collect and conflate them. Give us a few months...

In the meantime, sign up for the Open Historical Maps project listserv [5] that some of the OSM crew is working on (including the geniuses at Topomancy).

Also, this came out of a historical geospatial hack day [5] we threw a few months back, which you should check out if you want to play around with some of our data sources for this kind of work or for building something else out of historical NYC's geospatial footprint.

[1]: http://andrewxhill.github.io/cartodb-examples/scroll-story/b... [2]: http://maps.nypl.org/warper/layers/861 Tileserver, please forgive me for linking to you [3]: http://aaronland.info/nypl-perris/ YEAH SHAPEFILES! [4]: http://vimeopro.com/openstreetmapus/state-of-the-map-us-2013... Schuyler Earle's presentation on their version of historical gazetteer they're building for the Library of Congress at State of The Map US 2013 [5]: http://www.nypl.org/blog/2013/07/12/maphack-hacking-nycs-pas...

This sounds awesome. I've spent a lot of time digging around in the records at 30 Chambers and it is really hard to comprehend how much historical stuff is lying around in decaying pages.

Even just something as simple as showing an OSM with all the historical election districts / assembly districts over time for each census/election as map layers would visually convey to someone looking at an address what would otherwise take a decent amount of time to look up.

The NYC Dept. of Records has all the tax lot photos from the 1940 and 1980 canvas, so you could even build up a historical "street view" for the 5 boroughs. I've always found it annoying that they keep this data locked up and charge a decent sum for each photo. It is something that a decent microfilm scanner could make quick work of, but I don't know if they have plans to liberate all of that image data as part of the open data efforts.

This can be boon for Environmentalist to figure out:- "How many trees in a given area is cut down in given month." we can zeroed-in on places where deforestation is happening rapidly, and publish that data to "Goverments" to look into that.

NYPL does a lot of great open source mapping work. They also funded an open source raster map warper that I've forked.


I literally love that map warper! You can upload your own images at mapwarper.net as well.

You LITERALLY built that map warper! ^^ this guy ^^ is one of the partners at Topomancy, the crew who envisaged and built our whole historical geospatial stack with our Geospatial Librarian. And the Map Warper/Digitizer is absolutely amazing.

What a cool project. Kudos to the NYPL for releasing this.

I've been working on something similar recently using OpenCV (though haven't done much on it yet). My use case is to find paths through electrical schematics. I figured my problem was close to map vectorization so I searched around for an existing library in that space but didn't have any luck.

Will be interesting to see how they've approached it and what sort of results I could get using their library.

OpenCV is used here only for the "has dot/cross" aspect of feature detection (and it is very primitive still). The polygons themselves are more a work of R and GDAL.

This is a great idea but I question the process in the example - should they not be capturing the lines and then dividing into areas rather than trying to draw areas instantly (the areas being inside the line with random width spaces between)? Also - a good process to capture the information will probably vary significantly from map to map.

It does vary from map to map although it is optimized for insurance maps (or any map made mostly with clearly delineated polygons). We've included a config file that you would modify to suit the maps you are working with. The current process takes advantage of preexisting features in mapping tools and adds feature detection (e.g. polygon is/isn't a building) and concave hulls.

We do welcome input and merge requests to improve the tool!

This is pretty cool. A lot of manual work on OSM, back when I was with some OSM nerds in DC, faced the problem of manual imports and a tool like this would have been very, very useful.

Thank you NYPL!

OSM's been pretty hardcore about their no import policy because of accuracy issues in validation. They want to map these places themselves and don't want to trust others for accuracy, let alone the legal issues that come with importing data (ugh).

We, however are far more lenient (mostly because we can't afford to build a time machine to map the past ourselves).

On the other hand, Mike Migurski's Green Means Go [1] project is fantastic for figuring out where batch imports into OSM will be greeted with confetti and parades for filling out parts of the US without enough coverage to warrant anti-import protectionism.

[1]: http://mike.teczno.com/notes/green-means-go.html

The OSM opinion is not quite as black'n'white as that, though they did react badly to the TIGER import in the US (with good reasons), other areas have imported other types of data on quite large scales e.g. Danish addresses from government data.


But most of that's importing already digitised info. Relevant to this topic, vectorising from images, there seems to be renewed work on autotracing from satellite imagery coming to the iD editor:


edit: just noticed you linked to that elsewhere in this thread.

I am well aware, but there is a dedicated group of very advanced OSM users using JOSM, QGIS, and manual vetting of data, so it is a semi-automated import.

That being said, thank you for throwing this up for the uninitiated. Some of the emails to the OSM-main and OSM-US list were hilarious, like people populating entire country-worth datasets. Boy, did people yell at them.




Very very cool. Anybody know of similar projects?

I think I'll give it a whirl on some old 60's fenceline maps I have around.

Mapbox is working on a really slick tool for guided feature extraction from satellite maps [1] that'll be part of the iD editor for OpenStreetMap. I'm hoping this will provide a HUGE help in adding more buildings into OSM.

And while not mapping, John Resig's Ukiyo-e [2] project is one of the coolest projects I've seen in the digital cultural heritage space in some time. He's been applying image recognition to these incredible Japanese woodblock prints from museums, galleries, dealers, universities and libraries all around the world. Because these things are prints, there could be hundreds of prints from the same block master all around the world, but because the expertise in the field is so divergent, the cataloging practices are really inconsistent. Different institutions might call artists by totally different names (or think a print is by totally different artists). So he built a search-by-image search engine of hundreds of thousands of Ukiyo-e that finds and reunifies prints totally independently of their metadata. Which is cool when you find 5-10 of the same print in places around the world. But it's cooler when it matches 2 prints with totally different artists and publishers and dates because at some point after the first print was made, someone bought the block master, cut out the face and replaced it with another, then did the same for the signature [3][4].

Actually, The Vectorizer owes a big debt of gratitude to John and his brother Mike. Mike Resig is a geographer and was the first to show us a process for how this kind of automated identification is possible.

[1]: http://www.mapbox.com/blog/user-friendly-guided-feature-extr... [2]: http://ukiyo-e.org [3]: http://ukiyo-e.org/image/met/DP134583 [4]: http://ukiyo-e.org/image/mfa/sc214530

MapBox is working on a user-assisted vectorizer for satellite images: http://www.mapbox.com/blog/user-friendly-guided-feature-extr...

There are lots of commercial packages. (e.g. The ArcScan extension to ArcGIS: http://resources.arcgis.com/en/help/main/10.1/index.html#//0... )

A naive version is also fairly easy to build up from existing tools (e.g. OpenCV), but getting it to work reliably is often a pain. Even with the best tools, this sort of thing requires a lot of manual QC. It still greatly speeds things up, though!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact