Hacker News new | past | comments | ask | show | jobs | submit login
A fast, offline reverse geocoder in Python (github.com/thampiman)
181 points by jp_sc on March 28, 2015 | hide | past | favorite | 43 comments



If you're interested in making both forward and reverse geocoding better, please consider paying attention to a project I started and help maintain called OpenAddresses:

http://openaddresses.io

The goal is to collect address datasets so that forward and reverse geocoding is an easier problem to solve. A contributor wrote an excellent overview of the project the other day:

https://medium.com/colemanm/creating-an-open-database-of-add...


It's a nice overview but it glosses over the fact that while the OA database is governed by a CC0 license, the individual address collections in the database are still governed by their own licenses which can be (much) more restrictive. The fact that you can download the data doesn't mean you can use it the way you want. The OA web site hints at this but doesn't address (ah-hah) that underlying problem. That doesn't mean that OA is not valuable - quite the contrary - but I think the fact that it's presented as one big free and open dataset can be misleading.


As the (dead) sibling comment points out, my (non-lawyer) understanding of US copyright law is that simple collections of facts, when compiled in a way that requires no creativity, do not enjoy any copyright protection at all.

I would be surprised if a simple list of addresses, even a very large one, is something that could be subject to copyright.


Very cool project. Out of curiosity, how do you test the validity of the addresses?


All of the addresses come from "authoritative" datasets (i.e. from local or state governments) so we're assuming them to be correct.


Interesting project!

I assume you don't have a manual procedure to report and correct individual addresses that are in error, and the trick would be to contact the agency that provided the erroneous dataset, is that right?

It would be useful to have a way to find out "where did the data for this location come from, and who do I contact to correct it?"

For example, our house doesn't have an outline on your sample map, although there is a red dot there. Our street address shows up on the house next door, and the neighboring houses in that direction are shifted similarly.

Most other online maps for our location have the same error or similar. I think the confusion happened because we are on a corner and some years ago a previous owner changed our street address from one street to the other. So whatever agency maintains this data goofed something up in the process.

If there were an easy way to find out who this mystery agency is, then I could help them sort out this mixup. :-)


It isn't really user friendly, but I would say it is fairly easy to find what information they are using by looking at the source files:

https://github.com/openaddresses/openaddresses/tree/master/s...

(In the US the data might come from someone who accepts updates from someone else, but it is usually the state, county or city


You can't really use KD trees with lat/lon coordinates, at least you can't use euclidean distance there for nearest neighbor search.

First, longitude wraps from -180 to +180 at antimeridian, meaning distance calculations will fail there; second, and I'd say more importantly, one degree longitude length in meters differs a lot depending on latitude; meaning this library will be heavily biased towards longitudal neighbors when using it for locations far from equator.


I can recommend the excellent Geographic lib::Geodesic for this having used it in the past.

http://geographiclib.sourceforge.net

There is a python implementation available as well. http://pypi.python.org/pypi/geographiclib


Thanks!


I've fixed this by converting the geodetic coordinates to ECEF. See v1.2 release notes here: https://github.com/thampiman/reverse-geocoder


Good point. I would need to use this distance function instead: http://www.movable-type.co.uk/scripts/latlong.html

On it!


Here's a python function we use to calculate haversine distances:

  import math
  from collections import namedtuple

  def haversine_distance(origin, destination):
      """ Haversine formula to calculate the distance between two lat/long points on a sphere """

      radius = 6371 # FAA approved globe radius in km

      dlat = math.radians(destination.lat-origin.lat)
      dlon = math.radians(destination.lng-origin.lng)
      a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(origin.lat)) \
          * math.cos(math.radians(destination.lat)) * math.sin(dlon/2) * math.sin(dlon/2)
      c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
      d = radius * c

      # Return distance in km
      return int(math.floor(d))

  LatLng = namedtuple('LatLng', 'lat, lng')
  
  origin = LatLng(51.507222, -0.1275) # London
  destination = LatLng(37.966667, 23.716667) # Athens

  print "Distance (km): %d" % haversine_distance(origin, destination)


Kudos for a very well done README (and it's not just cribbed from the original project, it explains the new stuff very well and tells what the project is, and gives credit back). So many projects neglect the README.

One question - is it OK to put an MIT license on something that is based on LGPL code? I don't know enough about how the LGPL works (I do know it is less "infective" than plain GPL).

Well two questions: python2, or python3?


Thanks for that comment!

Good question regarding the license. I'm not too sure about that. I'd appreciate it if someone could shed some light on it.

Regarding the version, I've only tested it on python2. I should add that in the README. Thanks!


If its based on LGPL code, then any original code need to retain it's license.

Beyond that, everything regarding licenses look good if you ask me. MIT is both compatible if you modify the LGPL code or if you link with it, so one can use either of the two licenses.


hi, I am the author of the original library, which uses the LGPL license. My reason for using LGPL was so people would be obligated to share their modifications, so I would expect this is not compatible with MIT.


As of now, for Python 3 it does not work, but it seams that fixes are not hard: https://github.com/thampiman/reverse-geocoder/issues/2


UPDATE: I've just released v1.2 which supports Python 3. For details: https://github.com/thampiman/reverse-geocoder


Sweet!


While we're on this subject, is there a good, free street address parser that will work for at least the US, Canada, UK, and the major EU countries? I've tried most of the available ones, and they can parse about 90-95% of business addresses.

(Regular expressions don't work well for this. Neither does starting from the beginning of the address. Proper address parsing starts at the end of the address and works backwards, with the information found near the end, such as country name and postal code, used to disambiguate the information found earlier.)


The most principled approach I've seen on this is at https://github.com/datamade/usaddress. They use tagged training data and conditional random fields. I haven't seen comparisons with other systems, but it's worked well enough for my projects.

Though as the name suggests, it's only trained for US addresses.


Have you tried Nominatim? I've found it to work for most of the above. I'm only disappointed it doesn't work well for developing countries, particularly a lot of Asia.


Very good companion for Geocoder - https://github.com/DenisCarriere/geocoder. Glad to see Python getting more geo libraries for Non-GIS users.


Are there any offline Geocoders that work for the whole world, even if not free? Nominatim doesn't work for a lot of Asian addresses.


Very impressive, I'll be looking closer at K-D trees.

I wrote a quick (500k lookups/sec) offline geocoder for Ruby: https://github.com/bronson/geolocal to comply with the silly EU cookie rules. It precompiles the statements you're interested in:

    Geolocal.in_eu?(request.ip)
    Geolocal.in_us?('8.8.8.8')
Glad to see that my lib has a role model if it ever grows up. :)


Awesome, one more thing that can be made standalone instead of using google maps service.


Thanks! I'm the developer of this library and I hope you find it useful.


Looks really interesting!

Would it be possible to use OpenStreetMap data?

http://planet.openstreetmap.org/


OSM data doesn't contain an easy way to find the top 1000 cities. You'd end up with 100.000s. Looking for wikipedia tags, population (which often comes from Wikipedia) and 'admin' tags might be a good guide.

(I work on a OSM geocoder, not offline but has a Python library http://geocoder.opencagedata.com/)


That is interesting. I haven't looked at that data set. At the moment, this only looks at cities with a population > 1000 obtained from GeoNames.


Nice! Shameless plug for a SQLite no network geocoder that uses (I believe) the same text files to seed everything. https://github.com/NickStefan/no-network-geocoder


On a related note: An efficient geolocation encoder/decoder with error correction using Reed-Solomon. 3m accuracy with error correction in 10 symbols. 20mm accuracy with 5-nines certainty in 15 symbols:

https://github.com/pjkundert/ezpwd-reed-solomon


Starred. We're currently using nominatim + osm data + postgis on our own hosted servers. Can this be a good alternative?


I should think so. I've tried nominatim/osm data but it took forever to query a large set of coordinates. I was only interested in knowing the nearest city and admin regions 1/2. And this library is really fast... ~20s to lookup 10M coordinates on my MBP. If you'd however like to know the full address, then this is maybe not a good idea.


That's fast. Though yes, our use case involves reverse geo of full street level addresses. Currently we do several 10s to 100s of req/s on nominatim.

The sibling thread asked about using OSM data; it'd be awesome if street level OSM data is workable.


For reverse geocoding only as it does not seem to do geocoding proper. Pelias[1] might be a better alternative once they simplified the install process. I had to "reverse engineer" (aka read and understand) their chef cookbook as a Vagrant was not an option form me. Not that complicated but time consuming when don't know Chef.

[0] - https://github.com/pelias


We are currently using Postgis + geonames (the same DB used by this project) and it works very well. I'm curious to see benchmarks comparing the two.

edit: We are only interested in knowing the nearest big city though


This is great, does anyone know of a js version? I'm currently using http://nominatim.openstreetmap.org/reverse in my Node app but I'd rather not rely on a 3rd party, especially under heavy load.



Thanks for posting this. Would appreciate pull requests and feature requests or simply general feedback if you use this in practice.


This is super cool! Shameless plug. If you're looking for street-level reverse (or forward) geocoding, we offer[1] a super affordable API and CSV upload tool.

[1] http://geocod.io


Hello, I read a little bit about geocoding on wikipedia but was hoping to learn more. Is a good beginner guide on geocdoer/geocoding?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: