Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: what features do you want in a geocoder?
50 points by freyfogle on May 12, 2014 | hide | past | favorite | 66 comments
Hi,

we're building a public facing geocoding service (forward and reverse) on top of our own technology, OpenStreetMap, and various others open geo services. What features would make such a service compelling for developers? What is your wishlist?

Thanks for taking the time to answer.




Depends on what your target customers are doing!

For a delivery company, inaccurate results can send drivers to the wrong place, so it's important to get the best accuracy available.

You want it to work way out in the country, where buildings are few and far between; and with named buildings as well as numbered ones. So a search like [1] should come back accurate to less than 100m.

Fuzzy/imprecise matching should be used with care. If there's a search for Manor Close in London, it should ask which of the four Manor Closes you mean [2]. If the only part of the address that matches is London, that's not enough information to send a delivery driver - the address should be rejected.

If there are parts of the address you can't match that's sometimes a problem - you don't want to map 1 Hopton Parade, Streatham High Road to 1 Streatham High Road. But you do want to map Some Company Ltd, 1 Streatham High Road to the latter.

On the other hand, if your target users are dating websites wanting to show rough distances between members, just matching city might be plenty accurate enough; property search engines like Zoopla will show any shitty approximation on their maps if they don't recognise a street or postcode.

[1] https://maps.google.co.uk/maps?q=Paradise+Wildlife+Park,+Whi... [2] https://maps.google.co.uk/maps?q=from:E17+5RT+to:NW7+3NG+to:...


A downloadable bulk geocoding service. Some address databases are not licensed for exposure to 3rd parties, but geocoding is very interesting.

If you added CASS or address validation (deliverability) service it would increase the value even more.

Having recently installed PostGIS and imported loads of TIGER data, it would be useful if you provided some discussion about data sets backing your geocoder, especially if you do better than just TIGER.

I am only interested in U.S./Canada addresses (thus the mentions of CASS/verification) but I understand OSM is a global project.


thanks for the feedback. If I may ask, there are several people who supply exactly what you're asking for. Why are you not using them?

To be open with you, the North American market is well served and highly competitive, many other parts of the world are the opposite. Thus the US/CA probably won't be our area of focus. That being said, very keen to learn what isn't working for you in current solutions.


I am building a PostGIS/TIGER/PAGC/other data engine to improve address data. It comes in dirty, I want to improve it as much as possible: Normalize, geocode, verify, correct. Correction is basically a pipe dream for bulk operation but it's needed for point-fixes. Everything else is feasible in bulk. I haven't found all the features under one hood yet.

It has to be downloadable because I am not allowed to send my data out to third parties. It has to be free because I have a LOT of data. The North American markets for shipping packages are well served by multiple products in this space; efforts to contextualize/augment poor quality North American address data are not well served.

Hope that helps!


Ability to report inaccurate addresses without just telling the customer to go to OpenStreetMap and edit it.

We have a lot of problems with Google Maps not knowing about odd addresses like 235B Whatever St. or 113-76 XYZ street. This is perhaps because its a new construction and the address isn't in the database yet.

Town names in the Hamptons are quite contentious. What the post office knows it as is not what the locals call it, and nothing like what the real estate agents call it (but they make up their own and they lie about the locations).

But primarily you need to parse native language and messy addresses. 2 Ave, First Street, Fifth Ave, Madison Ave (same as 5th), CPW (central park west)

This is where google wins.


It's funny you mention that, as our main business is the real estate search engine Nestoria http://www.nestoria.com We parse about listings 15M addresses in 9 different countries every day (though not the US). We work in pretty chaotic markets like India and Brazil. The world is a very diverse place, but there is one constant - agents do not feel the need to let themselves be bound by the "on the ground" truth of where a listing is.

Regarding Google, you are right, they do a good job, especially in the US. Full credit to them. The problem is the cost and usage restrictions.

Coming back to your initial point of reporting inaccuracies. What would be your preferred way to report problems? some sort of API you could automate? Would you just tell us there is a problem, or want to also tell us the solution?


If you are using OSM then it would be for you to take the change request and then perform it yourself. Don't plan to do this at scale - just do it so that your client thinks they are getting full service. They wouldn't bother to do it very often.

But the real estate agents do get frustrated when its a real mapping error.

India - in Chennai they have 2 numbers on each house : old numbering and new numbering. So addresses are shown as 17 : 34

and the rickshaw drivers have never seen a map in their lives. I hold up a google map on the phone and they stare fascinated at it for minutes, but it really has nothing to do with the reality of getting there. Many people don't even know the names of the streets and they change the names all the time. Its done by corners and what landmarks are there.

curious - how did you come up with the name nestoria ?


Regarding the name, we just did some brainstorming and a member of the team just came up with the name, nothing fancier than that. Our requirements were

- domain available - written the way it is spoken - shorter is better

in our old logo we used to use the "nest" image, but that only makes sense in some languages. Last year we went with a more modern look.


One thing that I make use of but don't see too many services providing is some kind of 'match level' - where the geocoder returns a code indicating how confident it is about the quality of it's result. A result of 1 might mean a building level match, while 100 might be street level etc.

IIRC Google's geocoder does something like this, but it's pretty inaccurate, overstating it's match level consistently.

As others have said, geocoding is very hard to do well, but I commend the efforts being made with Nominatim and komoot/photon.


agree, a simple to understand confidence score is critical.

Also agree nominatim and photon are impressive


FWIW I ran millions of addresses through PostGIS geocoders (using both TIGER as well as PAGC's normalizer functions) and found that MOST addresses geocoded with confidence level 0 or 1. 60% were 0 or 1, the other 40% were spread across ratings 2 - 100.

I don't have a great way to characterize the geographic coverage or data quality of the geocoder, but it is clear that it has a data set which must be maintained to support geocoding into the future. Soon I'll have to start figuring out how long my current data is useful, and how long it will be before the next update from the census bureau.

I'm starting to think that it's crazy for so many businesses to need reliable GIS data and have so few sources to go for it. With the right organizational structure, we could be croudsourcing it it daily.

But I digress.


One thing that our application (processing real-estate data feeds) needs is the ability to figure out an approximate location if the address is a little vague.

For example, when we a vague address such as "Ichobod Crane Circle", we still want to get a position because that road is very short. However, if the address is something like "Sleepy Hollow Road" or "Murders Kill Road" (Coxsackie sure has some weird names), those are very long roads, and placing a marker anywhere on them would be meaningless.

Google solves this for us by providing, in the results, the bounding box of the result. When the match is not "street_address" but something else such as "route", "premise", "point_of_interest" or the like, what we do is take the bounding_box, calculate the area, and use the area if it's less than 500x500 meters. It's not optimal, but it's better than having no location at all.

Another thing that Google does semi-well is constrain the search to a specific area, like a country or a state. Unfortunately, Google doesn't let you pass in more than one state, but other than that, it works well. Some of the addresses we get are so vague that they would geocode to other countries (Oxford, England instead of Oxford, NY) if it were not for this filtering ability.


we have a long history in real estate, very familiar with exactly the problems you describe.


Awesome. We need bulk requests (one or more lat/lng), and reverse geocoding with locale components (state, county, city, neighborhood) Extending tzaman's localization request, a globally unique identifier, e.g. ISO code, for every piece of locale when reverse geocoding is critical for us. When storing reverse geocoded points in our own database I want to key off the unique values but lookup the locale specific versions later on client devices (ideally via REST or an offline API if possible).

/geocode?latlng=47.639548,-122.356957&language=en,fr

{ "ISO":{ Country:"US", Administrative:"WA", SubAdministrative:"King", Locality:"Seattle", SubLocality:"Queen Anne" }, "fr":{ Country:"Etas-Unis", Administrative:"Washington", SubAdministrative:"Roi County", Locality:"Seattle", SubLocality:"Renne Anne" } "en":{ Country:"US", Administrative:"Washington", SubAdministrative:"King County", Locality:"Seattle", SubLocality:"Queen Anne" } }


Can I dig a little deeper into this as your example has me indulging in some furious head scratching.

Providing language synonyms makes perfect sense where these exist (cf: London in English, Londra in Italian, Londres in French).

But your example implies translation of place names into their language specific equivalent. Kings County in Washington state is, unless I'm mistaken, Kings County in all other languages. Although the local residents may disagree, this county isn't blessed with a language synonym as it doesn't fall into the (ill defined) category of "well known place with a language variant".

Unless you're suggesting that if, say, French is requested as a language, a geocoder should translate place names so "Kings County" would (maybe) be "Comté Roi" in French. Although this approach sounds odd to me as (AFAIK) no one else refers to this place in this way?


Sorry for the confusion, let's see if I can clear this up. We'd like to see these locales translated to the same names that the native map program on a device would show (which is what I rely on now). For example, on my iPhone, when I switch to French and I'm in the Mission District of SF it says "Etats-Unis" / "Quartier de la Mission". Actually, that's a bad/rare example, a better one is a multi-lingual country like Switzerland where you can have 3 languages at once for some cities/neighborhoods. I want to pass the locale of the speaker to the API and get back what they'd expect in their local dialect.

Make sense?


Makes a lot more sense, yes. Thanks for this.

I do have to ask how realistic this is. It makes sense for places that do have multi language versions. So "Etats-Unis" for the US, when in French, makes total sense.

But does translating the Mission to "Quartier de la Mission" make sense? It makes me go "errrr?".

To put it another way, taking the ever present UK "High Street" as an example. I'd expect to see this as "High Street" regardess of language and not as "Grande Rue" because no one ever says this of uses it.

So yes, I think passing the locale to the service makes a lot of sense. Supporting those places which have multi language versions makes a lot of sense too. Translating all place names to a specific language makes less sense.

Or am I missing the point? It's always probable.


God, no. Everyone is trying to build a Geocoder and everyone is failing, because no one is actually realizing that geocoding is probably the most complex topic in GIS.

"Yeah, let's put some Elasticsearch and PostgreSQL and it will work out fine." No, it won't - you have no idea. And of course you won't believe me, but let me list some problems you will have that you don't realize right now:

* There is a lot of different charsets. Latin, kyrillic, there are umlauts, RTL, weird abbreviations, language standards that you don't know, because you don't know enough about foreign cultures.

* It's a shitload of data: OpenStreetMap is expanded about 700 GB large (not including history). And you will want to have autocompletition or autosuggestion, so response times will have to be < 100 ms.

* Ranking. Your user types "Tokyo". Is it the restaurant next to the user, is it the capital of Japan or is it some village next to Shitfuckistan?

No matter what, it will take you about a year to get any usable result. So I suggest you to look into Nominatim (the standard geocoder of OpenStreetMap which has actually got a lot better) or Photon (a geocoder based on the Nominatim DB, but with auto suggestion).


Another new player that looks promising is Pelias (http://mapzen.com/pelias/). Early days but it's open source (https://github.com/mapzen/pelias) and it seems to be improving at a good clip.


> or is it some village next to Shitfuckistan?

Wow, where did that come from?


I'm going to guess it comes from the frustration of getting dumb results rankings that don't take into account the likelihood that the result is actually something you wanted.



Agree that Nominatim is awesome. The way it handles disambiguation is thoughtful.

In my opinion, it's better than any closed source competitor.


Thanks, but it's too late, we're fallen under the geo spell. We're very familiar with nominatim (it's many strengths, but also significant weaknesses), and the challenge of geocoding across many different parts of the world, which as you mention are not trivial.


Geocoding is a problem that needs to be solved well once, and open for everyone to use.

Since you seem to share Nominatim's goals, why have you decided to create your own solution rather than work to improve Nominatim's weaknesses? Genuinely curious.


What makes you think we will not be working with and improving nominatim, or many of the other good open tools and datasets (for example geonames to mention just one of several)? Sorry if I've implied that.

For some context, my company, Lokku, has sponsored many a SotM (including the first one in 2007, and we're sponsoring SotM-EU next month), has repeatedly donated what I like to think are significant sums to the OSM foundation, and this year our company xmas gift to clients was a donation on their behalf to HOT-OSM. We're members of the UK's Open Data Institute, were one of the first companies to move to using OSM tiles in place of Google, run #geomob (a geo innovation meetup in London, hope to see you at our next one which happens to be tomorrow), and actively invest in geo start ups like SplashMaps. So I think I can safely say: we get it.

We've looked at nominatim and contributed to the code. Like any complex codebase it has strengths and weaknesses. It is a significant improvement on what came before, congrats to all who contributed. It does not follow from that that all effort at inventing a better future has to fall under the nominatim umbrella.

My question is not should we try to build a better geocoding service (be it an extension of nominatim, a replacement of nominatim, or whatever) It is what would it look like?

Some links to the various points above about what we've been up to at Lokku http://geomobldn.org/ http://blog.lokku.com/post/77055320403/investing-in-splashma... http://blog.lokku.com/post/70479246283/donating-to-the-human...


> It does not follow from that that all effort at inventing a better future has to fall under the nominatim umbrella.

But that is exactly what I'm asking you. To put it simply: Why does this not follow?

It seems to me that it would be far better if everyone was working in the same direction, rather than forking the effort.

If you have some specific/commercial application in mind, I would understand, but this seems to contrast with your general request made here.


I guess we just differ on philosophy. I think a diversity of approaches leads to more positive outcomes. Why limit ourselves to the structure and mental paths that are already there? In the same way that OSM has many editors, each with strengths and weaknesses, why should it have only one geocoder? That said, rest assured we have no plans to fork nominatim.


I don't think we differ on philosophy... clearly there are different geocoders which are competing and I am interested to know more about what you're doing. That's why I'm asking about your motivation.

If you think the problems with Nominatim cannot be overcome within Nominatim, what makes you think that you can do as good a job and overcome those problems without Nominatim? That doesn't fully make sense to me yet.

I've annoyed you enough, sorry :) Thanks for your patience.


Nominatim is not built to handle autocomplete, which is a requirement for many consumer apps. Thus the focus on Lucene, where you get that for free, instead of building your own fulltext search on top of PostGIS.


I've used pretty much all of the big geocoding services, and here are problems I've ran into.

1. Rate limiting: I get it, you have to make money and/or limit your freeloading, but rate limiting has killed things I've built in the past, especially Google's hard rate limit. A soft rate limit, or an alternate way to monetize, would be huge.

2. Accuracy: MapBox's geocoder is not good. Aside from inaccurate map tiles, their geocoder misses entire US zip codes. PLEASE at least include helpful error messages and a path to report incorrect results.

3. A solution for shared IPs and rate limiting. I have helped several small websites that do not come close to approaching Google's daily rate limit, but because their IP was used by someone else, they are not allowed to make geocoding calls. This forced us to use a different service.

Honorable mention: It would be nice to be able to specify what data I get back from a call. If all I need is lat/lng, I don't need another kilobyte of neighborhood/city/time zone info in my result.

Hope this helps.


...rate limiting has killed things I've built in the past, especially Google's hard rate limit.

Sometimes it's possible to shift queries to the client, and then build in enough intelligence to: run only ten queries at a time, delay queries by a period that backs off, save results in localStorage, etc.

This won't solve all problems, and perhaps it annoys users to see the first ten locations pop up immediately while subsequent locations have some random delay the first time they visit a particular resource, but it does make some things possible that would not be otherwise.


It helps a lot, thanks.

re: alternate ways to monetize, what do you propose?


Two options the way I see it:

* Offer unlimited access for more money (popular) * Offer a cheap, simple way to batch requests of very large and/or unlimited sizes (e.g., CSV upload)

The latter is nice, because it's not worth it for me to upload a CSV file with 1 address in it -- I might as well use your regular API. And it's not worth it for just 25 addresses, either. There's some threshold where it becomes more useful for me to submit my addresses in bulk, and that's where the CSV files come in. It should be way cheaper for you to process a file with 1 million rows than it would be to process 1 million API requests, so if you were to do that, it would be a gold mine for businesses like mine that require geocoding capabilities for millions of addresses at a time.


I really don't want to hijack the thread, but I couldn't help notice that the company I work for[1] recently added a lot of these features which might be of interest.

We have support for both bulk CSV upload and an API endpoint for batch geocoding. We are also starting to introduce unlimited access for a flat monthly fee (with no limits to requests per sec), please contact us if you're interested [2].

[1] http://geocod.io [2] hello@geocod.io


Hi Mathias,

big fan of your company's approach, congrats on your progress. The one weakness is that it's limited to just the US.


Make less money. Make it a contribution back to the Internet.


A noble intention but doesn't mesh perfectly with the reality of my lifestyle, ie feeding my kids.


Mine either but I have a day job for that.


An understanding of colloquial geography, such as:

- informal place names

- boundaries of neighbourhoods

- nesting of those things within administrative boundaries

Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates. GeoNames is pretty messy and inaccurate, and the copyright status has never been cleared up.


"Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates."

Hi Tom - as Ed says, big fan of the work you did with Flickr's shapefiles and I still use your boundaries site on a regular basis.

Yes, GeoPlanet has vanished for download from the YDN site but all versions up to 7.10 are still on archive.org (http://archive.org/search.php?query=geoplanet) thanks to the combination of Aaron of Montreal and the CC-BY-SA license we released the data under.


Hi Tom,

thanks for commenting. Big fan of your work on flickr neighbourhood boundaries.

We hear you and are on it, which doesn't mean we'll be perfect of course, but definitely aware of this issue.


Also, beware of places like Carmel-by-the-sea, CA, where there are no street addresses! None of the houses in downtown Carmel have mailboxes -- they all have P.O. boxes at the post office downtown. If you try and geocode these, you wind up with the post office lat/lng and not the house! Frustrating as hell...


* Support for and intelligent detection of a variety of coordinate formats including LAt/Lon, utm, and township and range. This would be really useful when dealing with old well or surveyors logs etc.

* Better support for terrain features. Google is getting better here but for a while "Mount Rainier" was sending you to the parks business office.

* Better support for localized search. This ties into the last one, a frequent use for me is to be zoomed in on a general area and want to find an obscure creek or peak.

* Better support for non driving use cases. Google has a nasty habit of resolving things like unquoted locations to the nearest drivable street address which is really stupid when you are using it to find a wilderness lake or something.

* Finer grained search by type.

(FWIW I run hillmap.com so most of my desires spring from the needs of a service targeted at hikers and backcountry skiers.)


thanks, good feedback


Simple.

I've worked in GIS for a number of years. I've worked on marine and scientific data management on top of GIS support. From google maps/earth to ArcGIS and pulling data from KML to OGC services.

If a service takes me nearly a month to learn to use, I'm going to push adamantly to use something else.


* API client with batching and parallelization built in (100 queries in a single request, multiple requests run in parallel, etc.)

* robustness in the face of bad street suffixes (for example, in Burlington, VT, you may find data with "CR" meaning "CIRCLE" instead of the official USPS "CREEK")

* fuzzy street name matching (PAKCER -> PACKER)

* accurate geocoding in rural United States

* fuzzy international place matching (like "ST PANCRAS ST STATION" in London)


In your experience is the rural US geocoding problem a software problem or a lack of underlying data?


Don't forget desktop and mobile applications. Most mapping and geocoding services do.

- I have to ship my API keys to end users. Someone could grab it and repurpose it - Rate limiting by API key penalizes one end customer for the other's misbehaving

I would love an API that is aware of the end user. Applies rate limiting on a user basis. Allows for anonymized user-based usage report. E.g. number of end users, average number of API calls by users, …


your comments generated a lot of discussion for us here, thanks. Our conclusion: if you build an app (mobile, desktop, whatever) that becomes popular and depends on a third party service, in our case a geocoder, it generates real costs for the thidd party service. So there are three potential groups who can pay the cost

a. the third party service provider can just provide service for free. We can't, at least not indefinitely.

b. the end consumer can somehow be billed by the third party service. Feels complicated, especially as the use of the service may be deep in the internals and behind the scenes of the app. The consumer may well have no idea it is being used

c. the application developer can pay. Either directly or via billing the end consumer.

Option c. feels like the only sustainable one. Happy to hear your thoughts on it though.


Option c is the way to go. Up to the app developer to see how to monetize the app.

Bonus points if the end user can be identified. E.g. if the app can pass an opaque token to the web service. Reporting / billing from the web service provider groups usage by token.


Thanks everyone for the feedback, very useful. Please keep it coming. I need to be offline for a bit, but will check in later. If you're interested in learning more about our progress please follow us on twitter. ta.

https://twitter.com/opencagedata


Apart from the most obvious (being accurate) I would say well documented API and properly localised results


thanks for commenting. If you don't mind, what exactly do you mean by "well localised"? Can you give me an example, ideally via a service you're currently using that is doing it badly. Cheers.


The Google geocoding service returns a normalized address in addition to the lat/lon. There are times I need that address localized for the area it is in, and other times I need it localized for another culture. It would be very useful to be able to specify which locale the return value should be.


makes sense, can you give me a specific example?


In China all the maps are in Chinese. There are times when I want them in English. I assume this is what the poster means.


Partially yes, plus the formatting, for example, some locales have zip in front of the city while some do it the other way around.


Get the timezone (and related information: utc offset, local time, etc) from a lat, lng point or address


Hi Hernantz,

Shameless plug, but this is something we recently started offering [1] for our geocoding service. I'm happy to help if you have any questions.

[1] http://geocod.io/docs/#toc_21


Quick and cheap daily batch geocoding with any type of export option, from CSV to JSON to XML.


Not sure what your definitions are of cheap or quick, nr of what country your data is in, but there are lots of people who do bulk geocoding. Why don't you use them?


I do use others. The op here was asking what we want in a geocoding service so I told him.

But to elaborate on what I mean by quick and cheap:

Quick means I don't have to wait for an email notifying me that it's done, and I don't want my requests queued for a couple hours. I want them upon request.

Cheap means better than the average price of the competitors. It would be nice to maybe pay $100 and be able to geocode 100K addresses, for example (but I don't really have a huge sample of competitor prices).


relax dude, I am the OP. Just trying to get specifics not general terms like "cheap". For one person $100 is rounding error, for the next it's a meaningful chunk of project budget. Thanks for clarifying.


Publicly visible and flexible pricing. Not $10k+ per year as Google Maps API.


Fast for single queries, not for batch geocoding?

So a web query can come from a user of a web service, the external(?) geocoding API can be called -- and the reply can go back to user [applying lon/lat processing] without waiting too long.

(I don't do anything like this for a while, so please apply NaCl.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: