we're building a public facing geocoding service (forward and reverse) on top of our own technology, OpenStreetMap, and various others open geo services. What features would make such a service compelling for developers? What is your wishlist?
Thanks for taking the time to answer.
For a delivery company, inaccurate results can send drivers to the wrong place, so it's important to get the best accuracy available.
You want it to work way out in the country, where buildings are few and far between; and with named buildings as well as numbered ones. So a search like  should come back accurate to less than 100m.
Fuzzy/imprecise matching should be used with care. If there's a search for Manor Close in London, it should ask which of the four Manor Closes you mean . If the only part of the address that matches is London, that's not enough information to send a delivery driver - the address should be rejected.
If there are parts of the address you can't match that's sometimes a problem - you don't want to map 1 Hopton Parade, Streatham High Road to 1 Streatham High Road. But you do want to map Some Company Ltd, 1 Streatham High Road to the latter.
On the other hand, if your target users are dating websites wanting to show rough distances between members, just matching city might be plenty accurate enough; property search engines like Zoopla will show any shitty approximation on their maps if they don't recognise a street or postcode.
If you added CASS or address validation (deliverability) service it would increase the value even more.
Having recently installed PostGIS and imported loads of TIGER data, it would be useful if you provided some discussion about data sets backing your geocoder, especially if you do better than just TIGER.
I am only interested in U.S./Canada addresses (thus the mentions of CASS/verification) but I understand OSM is a global project.
To be open with you, the North American market is well served and highly competitive, many other parts of the world are the opposite. Thus the US/CA probably won't be our area of focus. That being said, very keen to learn what isn't working for you in current solutions.
It has to be downloadable because I am not allowed to send my data out to third parties. It has to be free because I have a LOT of data. The North American markets for shipping packages are well served by multiple products in this space; efforts to contextualize/augment poor quality North American address data are not well served.
Hope that helps!
We have a lot of problems with Google Maps not knowing about odd addresses like 235B Whatever St. or 113-76 XYZ street. This is perhaps because its a new construction and the address isn't in the database yet.
Town names in the Hamptons are quite contentious. What the post office knows it as is not what the locals call it, and nothing like what the real estate agents call it (but they make up their own and they lie about the locations).
But primarily you need to parse native language and messy addresses. 2 Ave, First Street, Fifth Ave, Madison Ave (same as 5th), CPW (central park west)
This is where google wins.
Regarding Google, you are right, they do a good job, especially in the US. Full credit to them. The problem is the cost and usage restrictions.
Coming back to your initial point of reporting inaccuracies. What would be your preferred way to report problems? some sort of API you could automate? Would you just tell us there is a problem, or want to also tell us the solution?
But the real estate agents do get frustrated when its a real mapping error.
India - in Chennai they have 2 numbers on each house : old numbering and new numbering. So addresses are shown as 17 : 34
and the rickshaw drivers have never seen a map in their lives. I hold up a google map on the phone and they stare fascinated at it for minutes, but it really has nothing to do with the reality of getting there. Many people don't even know the names of the streets and they change the names all the time. Its done by corners and what landmarks are there.
curious - how did you come up with the name nestoria ?
- domain available
- written the way it is spoken
- shorter is better
in our old logo we used to use the "nest" image, but that only makes sense in some languages. Last year we went with a more modern look.
IIRC Google's geocoder does something like this, but it's pretty inaccurate, overstating it's match level consistently.
As others have said, geocoding is very hard to do well, but I commend the efforts being made with Nominatim and komoot/photon.
Also agree nominatim and photon are impressive
I don't have a great way to characterize the geographic coverage or data quality of the geocoder, but it is clear that it has a data set which must be maintained to support geocoding into the future. Soon I'll have to start figuring out how long my current data is useful, and how long it will be before the next update from the census bureau.
I'm starting to think that it's crazy for so many businesses to need reliable GIS data and have so few sources to go for it. With the right organizational structure, we could be croudsourcing it it daily.
But I digress.
For example, when we a vague address such as "Ichobod Crane Circle", we still want to get a position because that road is very short. However, if the address is something like "Sleepy Hollow Road" or "Murders Kill Road" (Coxsackie sure has some weird names), those are very long roads, and placing a marker anywhere on them would be meaningless.
Google solves this for us by providing, in the results, the bounding box of the result. When the match is not "street_address" but something else such as "route", "premise", "point_of_interest" or the like, what we do is take the bounding_box, calculate the area, and use the area if it's less than 500x500 meters. It's not optimal, but it's better than having no location at all.
Another thing that Google does semi-well is constrain the search to a specific area, like a country or a state. Unfortunately, Google doesn't let you pass in more than one state, but other than that, it works well. Some of the addresses we get are so vague that they would geocode to other countries (Oxford, England instead of Oxford, NY) if it were not for this filtering ability.
Providing language synonyms makes perfect sense where these exist (cf: London in English, Londra in Italian, Londres in French).
But your example implies translation of place names into their language specific equivalent. Kings County in Washington state is, unless I'm mistaken, Kings County in all other languages. Although the local residents may disagree, this county isn't blessed with a language synonym as it doesn't fall into the (ill defined) category of "well known place with a language variant".
Unless you're suggesting that if, say, French is requested as a language, a geocoder should translate place names so "Kings County" would (maybe) be "Comté Roi" in French. Although this approach sounds odd to me as (AFAIK) no one else refers to this place in this way?
I do have to ask how realistic this is. It makes sense for places that do have multi language versions. So "Etats-Unis" for the US, when in French, makes total sense.
But does translating the Mission to "Quartier de la Mission" make sense? It makes me go "errrr?".
To put it another way, taking the ever present UK "High Street" as an example. I'd expect to see this as "High Street" regardess of language and not as "Grande Rue" because no one ever says this of uses it.
So yes, I think passing the locale to the service makes a lot of sense. Supporting those places which have multi language versions makes a lot of sense too. Translating all place names to a specific language makes less sense.
Or am I missing the point? It's always probable.
"Yeah, let's put some Elasticsearch and PostgreSQL and it will work out fine." No, it won't - you have no idea. And of course you won't believe me, but let me list some problems you will have that you don't realize right now:
* There is a lot of different charsets. Latin, kyrillic, there are umlauts, RTL, weird abbreviations, language standards that you don't know, because you don't know enough about foreign cultures.
* It's a shitload of data: OpenStreetMap is expanded about 700 GB large (not including history). And you will want to have autocompletition or autosuggestion, so response times will have to be < 100 ms.
* Ranking. Your user types "Tokyo". Is it the restaurant next to the user, is it the capital of Japan or is it some village next to Shitfuckistan?
No matter what, it will take you about a year to get any usable result. So I suggest you to look into Nominatim (the standard geocoder of OpenStreetMap which has actually got a lot better) or Photon (a geocoder based on the Nominatim DB, but with auto suggestion).
Wow, where did that come from?
In my opinion, it's better than any closed source competitor.
Since you seem to share Nominatim's goals, why have you decided to create your own solution rather than work to improve Nominatim's weaknesses? Genuinely curious.
For some context, my company, Lokku, has sponsored many a SotM (including the first one in 2007, and we're sponsoring SotM-EU next month), has repeatedly donated what I like to think are significant sums to the OSM foundation, and this year our company xmas gift to clients was a donation on their behalf to HOT-OSM. We're members of the UK's Open Data Institute, were one of the first companies to move to using OSM tiles in place of Google, run #geomob (a geo innovation meetup in London, hope to see you at our next one which happens to be tomorrow), and actively invest in geo start ups like SplashMaps. So I think I can safely say: we get it.
We've looked at nominatim and contributed to the code. Like any complex codebase it has strengths and weaknesses. It is a significant improvement on what came before, congrats to all who contributed. It does not follow from that that all effort at inventing a better future has to fall under the nominatim umbrella.
My question is not should we try to build a better geocoding service (be it an extension of nominatim, a replacement of nominatim, or whatever) It is what would it look like?
Some links to the various points above about what we've been up to at Lokku
But that is exactly what I'm asking you.
To put it simply: Why does this not follow?
It seems to me that it would be far better if everyone was working in the same direction, rather than forking the effort.
If you have some specific/commercial application in mind, I would understand, but this seems to contrast with your general request made here.
If you think the problems with Nominatim cannot be overcome within Nominatim, what makes you think that you can do as good a job and overcome those problems without Nominatim?
That doesn't fully make sense to me yet.
I've annoyed you enough, sorry :) Thanks for your patience.
1. Rate limiting: I get it, you have to make money and/or limit your freeloading, but rate limiting has killed things I've built in the past, especially Google's hard rate limit. A soft rate limit, or an alternate way to monetize, would be huge.
2. Accuracy: MapBox's geocoder is not good. Aside from inaccurate map tiles, their geocoder misses entire US zip codes. PLEASE at least include helpful error messages and a path to report incorrect results.
3. A solution for shared IPs and rate limiting. I have helped several small websites that do not come close to approaching Google's daily rate limit, but because their IP was used by someone else, they are not allowed to make geocoding calls. This forced us to use a different service.
Honorable mention: It would be nice to be able to specify what data I get back from a call. If all I need is lat/lng, I don't need another kilobyte of neighborhood/city/time zone info in my result.
Hope this helps.
Sometimes it's possible to shift queries to the client, and then build in enough intelligence to: run only ten queries at a time, delay queries by a period that backs off, save results in localStorage, etc.
This won't solve all problems, and perhaps it annoys users to see the first ten locations pop up immediately while subsequent locations have some random delay the first time they visit a particular resource, but it does make some things possible that would not be otherwise.
re: alternate ways to monetize, what do you propose?
* Offer unlimited access for more money (popular)
* Offer a cheap, simple way to batch requests of very large and/or unlimited sizes (e.g., CSV upload)
The latter is nice, because it's not worth it for me to upload a CSV file with 1 address in it -- I might as well use your regular API. And it's not worth it for just 25 addresses, either. There's some threshold where it becomes more useful for me to submit my addresses in bulk, and that's where the CSV files come in. It should be way cheaper for you to process a file with 1 million rows than it would be to process 1 million API requests, so if you were to do that, it would be a gold mine for businesses like mine that require geocoding capabilities for millions of addresses at a time.
We have support for both bulk CSV upload and an API endpoint for batch geocoding. We are also starting to introduce unlimited access for a flat monthly fee (with no limits to requests per sec), please contact us if you're interested .
big fan of your company's approach, congrats on your progress. The one weakness is that it's limited to just the US.
- informal place names
- boundaries of neighbourhoods
- nesting of those things within administrative boundaries
Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be available to download any more, and they didn't accept updates. GeoNames is pretty messy and inaccurate, and the copyright status has never been cleared up.
Hi Tom - as Ed says, big fan of the work you did with Flickr's shapefiles and I still use your boundaries site on a regular basis.
Yes, GeoPlanet has vanished for download from the YDN site but all versions up to 7.10 are still on archive.org (http://archive.org/search.php?query=geoplanet) thanks to the combination of Aaron of Montreal and the CC-BY-SA license we released the data under.
thanks for commenting. Big fan of your work on flickr neighbourhood boundaries.
We hear you and are on it, which doesn't mean we'll be perfect of course, but definitely aware of this issue.
* Better support for terrain features. Google is getting better here but for a while "Mount Rainier" was sending you to the parks business office.
* Better support for localized search. This ties into the last one, a frequent use for me is to be zoomed in on a general area and want to find an obscure creek or peak.
* Better support for non driving use cases. Google has a nasty habit of resolving things like unquoted locations to the nearest drivable street address which is really stupid when you are using it to find a wilderness lake or something.
* Finer grained search by type.
(FWIW I run hillmap.com so most of my desires spring from the needs of a service targeted at hikers and backcountry skiers.)
I've worked in GIS for a number of years. I've worked on marine and scientific data management on top of GIS support. From google maps/earth to ArcGIS and pulling data from KML to OGC services.
If a service takes me nearly a month to learn to use, I'm going to push adamantly to use something else.
* robustness in the face of bad street suffixes (for example, in Burlington, VT, you may find data with "CR" meaning "CIRCLE" instead of the official USPS "CREEK")
* fuzzy street name matching (PAKCER -> PACKER)
* accurate geocoding in rural United States
* fuzzy international place matching (like "ST PANCRAS ST STATION" in London)
- I have to ship my API keys to end users. Someone could grab it and repurpose it
- Rate limiting by API key penalizes one end customer for the other's misbehaving
I would love an API that is aware of the end user. Applies rate limiting on a user basis. Allows for anonymized user-based usage report. E.g. number of end users, average number of API calls by users, …
a. the third party service provider can just provide service for free. We can't, at least not indefinitely.
b. the end consumer can somehow be billed by the third party service. Feels complicated, especially as the use of the service may be deep in the internals and behind the scenes of the app. The consumer may well have no idea it is being used
c. the application developer can pay. Either directly or via billing the end consumer.
Option c. feels like the only sustainable one. Happy to hear your thoughts on it though.
Bonus points if the end user can be identified. E.g. if the app can pass an opaque token to the web service. Reporting / billing from the web service provider groups usage by token.
Shameless plug, but this is something we recently started offering  for our geocoding service. I'm happy to help if you have any questions.
But to elaborate on what I mean by quick and cheap:
Quick means I don't have to wait for an email notifying me that it's done, and I don't want my requests queued for a couple hours. I want them upon request.
Cheap means better than the average price of the competitors. It would be nice to maybe pay $100 and be able to geocode 100K addresses, for example (but I don't really have a huge sample of competitor prices).
So a web query can come from a user of a web service, the external(?) geocoding API can be called -- and the reply can go back to user [applying lon/lat processing] without waiting too long.
(I don't do anything like this for a while, so please apply NaCl.)