
Ask HN: what features do you want in a geocoder? - freyfogle
Hi,<p>we&#x27;re building a public facing geocoding service (forward and reverse) on top of our own technology, OpenStreetMap, and various others open geo services. What features would make such a service compelling for developers? What is your wishlist?<p>Thanks for taking the time to answer.
======
michaelt
Depends on what your target customers are doing!

For a delivery company, inaccurate results can send drivers to the wrong
place, so it's important to get the best accuracy available.

You want it to work way out in the country, where buildings are few and far
between; and with named buildings as well as numbered ones. So a search like
[1] should come back accurate to less than 100m.

Fuzzy/imprecise matching should be used with care. If there's a search for
Manor Close in London, it should ask which of the four Manor Closes you mean
[2]. If the only part of the address that matches is London, that's not enough
information to send a delivery driver - the address should be rejected.

If there are parts of the address you can't match that's sometimes a problem -
you don't want to map 1 Hopton Parade, Streatham High Road to 1 Streatham High
Road. But you do want to map Some Company Ltd, 1 Streatham High Road to the
latter.

On the other hand, if your target users are dating websites wanting to show
rough distances between members, just matching city might be plenty accurate
enough; property search engines like Zoopla will show any shitty approximation
on their maps if they don't recognise a street or postcode.

[1]
[https://maps.google.co.uk/maps?q=Paradise+Wildlife+Park,+Whi...](https://maps.google.co.uk/maps?q=Paradise+Wildlife+Park,+White+Stubbs+Lane,+Broxbourne)
[2]
[https://maps.google.co.uk/maps?q=from:E17+5RT+to:NW7+3NG+to:...](https://maps.google.co.uk/maps?q=from:E17+5RT+to:NW7+3NG+to:NW9+9HD+to:SE28+8EY)

------
spacemanmatt
A downloadable bulk geocoding service. Some address databases are not licensed
for exposure to 3rd parties, but geocoding is very interesting.

If you added CASS or address validation (deliverability) service it would
increase the value even more.

Having recently installed PostGIS and imported loads of TIGER data, it would
be useful if you provided some discussion about data sets backing your
geocoder, especially if you do better than just TIGER.

I am only interested in U.S./Canada addresses (thus the mentions of
CASS/verification) but I understand OSM is a global project.

~~~
freyfogle
thanks for the feedback. If I may ask, there are several people who supply
exactly what you're asking for. Why are you not using them?

To be open with you, the North American market is well served and highly
competitive, many other parts of the world are the opposite. Thus the US/CA
probably won't be our area of focus. That being said, very keen to learn what
isn't working for you in current solutions.

~~~
spacemanmatt
I am building a PostGIS/TIGER/PAGC/other data engine to improve address data.
It comes in dirty, I want to improve it as much as possible: Normalize,
geocode, verify, correct. Correction is basically a pipe dream for bulk
operation but it's needed for point-fixes. Everything else is feasible in
bulk. I haven't found all the features under one hood yet.

It has to be downloadable because I am not allowed to send my data out to
third parties. It has to be free because I have a LOT of data. The North
American markets for shipping packages are well served by multiple products in
this space; efforts to contextualize/augment poor quality North American
address data are not well served.

Hope that helps!

------
crucialfelix
Ability to report inaccurate addresses without just telling the customer to go
to OpenStreetMap and edit it.

We have a lot of problems with Google Maps not knowing about odd addresses
like 235B Whatever St. or 113-76 XYZ street. This is perhaps because its a new
construction and the address isn't in the database yet.

Town names in the Hamptons are quite contentious. What the post office knows
it as is not what the locals call it, and nothing like what the real estate
agents call it (but they make up their own and they lie about the locations).

But primarily you need to parse native language and messy addresses. 2 Ave,
First Street, Fifth Ave, Madison Ave (same as 5th), CPW (central park west)

This is where google wins.

~~~
freyfogle
It's funny you mention that, as our main business is the real estate search
engine Nestoria [http://www.nestoria.com](http://www.nestoria.com) We parse
about listings 15M addresses in 9 different countries every day (though not
the US). We work in pretty chaotic markets like India and Brazil. The world is
a very diverse place, but there is one constant - agents do not feel the need
to let themselves be bound by the "on the ground" truth of where a listing is.

Regarding Google, you are right, they do a good job, especially in the US.
Full credit to them. The problem is the cost and usage restrictions.

Coming back to your initial point of reporting inaccuracies. What would be
your preferred way to report problems? some sort of API you could automate?
Would you just tell us there is a problem, or want to also tell us the
solution?

~~~
crucialfelix
If you are using OSM then it would be for you to take the change request and
then perform it yourself. Don't plan to do this at scale - just do it so that
your client thinks they are getting full service. They wouldn't bother to do
it very often.

But the real estate agents do get frustrated when its a real mapping error.

India - in Chennai they have 2 numbers on each house : old numbering and new
numbering. So addresses are shown as 17 : 34

and the rickshaw drivers have never seen a map in their lives. I hold up a
google map on the phone and they stare fascinated at it for minutes, but it
really has nothing to do with the reality of getting there. Many people don't
even know the names of the streets and they change the names all the time. Its
done by corners and what landmarks are there.

curious - how did you come up with the name nestoria ?

~~~
freyfogle
Regarding the name, we just did some brainstorming and a member of the team
just came up with the name, nothing fancier than that. Our requirements were

\- domain available \- written the way it is spoken \- shorter is better

in our old logo we used to use the "nest" image, but that only makes sense in
some languages. Last year we went with a more modern look.

------
ManAboutCouch
One thing that I make use of but don't see too many services providing is some
kind of 'match level' \- where the geocoder returns a code indicating how
confident it is about the quality of it's result. A result of 1 might mean a
building level match, while 100 might be street level etc.

IIRC Google's geocoder does something like this, but it's pretty inaccurate,
overstating it's match level consistently.

As others have said, geocoding is very hard to do well, but I commend the
efforts being made with Nominatim and komoot/photon.

~~~
freyfogle
agree, a simple to understand confidence score is critical.

Also agree nominatim and photon are impressive

~~~
spacemanmatt
FWIW I ran millions of addresses through PostGIS geocoders (using both TIGER
as well as PAGC's normalizer functions) and found that MOST addresses geocoded
with confidence level 0 or 1. 60% were 0 or 1, the other 40% were spread
across ratings 2 - 100.

I don't have a great way to characterize the geographic coverage or data
quality of the geocoder, but it is clear that it has a data set which must be
maintained to support geocoding into the future. Soon I'll have to start
figuring out how long my current data is useful, and how long it will be
before the next update from the census bureau.

I'm starting to think that it's crazy for so many businesses to need reliable
GIS data and have so few sources to go for it. With the right organizational
structure, we could be croudsourcing it it daily.

But I digress.

------
lobster_johnson
One thing that our application (processing real-estate data feeds) needs is
the ability to figure out an approximate location if the address is a little
vague.

For example, when we a vague address such as "Ichobod Crane Circle", we still
want to get a position because that road is very short. However, if the
address is something like "Sleepy Hollow Road" or "Murders Kill Road"
(Coxsackie sure has some weird names), those are very long roads, and placing
a marker anywhere on them would be meaningless.

Google solves this for us by providing, in the results, the bounding box of
the result. When the match is not "street_address" but something else such as
"route", "premise", "point_of_interest" or the like, what we do is take the
bounding_box, calculate the area, and use the area if it's less than 500x500
meters. It's not optimal, but it's better than having no location at all.

Another thing that Google does semi-well is constrain the search to a specific
area, like a country or a state. Unfortunately, Google doesn't let you pass in
more than one state, but other than that, it works well. Some of the addresses
we get are so vague that they would geocode to other countries (Oxford,
England instead of Oxford, NY) if it were not for this filtering ability.

~~~
freyfogle
we have a long history in real estate, very familiar with exactly the problems
you describe.

------
troysandal
Awesome. We need bulk requests (one or more lat/lng), and reverse geocoding
with locale components (state, county, city, neighborhood) Extending tzaman's
localization request, a globally unique identifier, e.g. ISO code, for every
piece of locale when reverse geocoding is critical for us. When storing
reverse geocoded points in our own database I want to key off the unique
values but lookup the locale specific versions later on client devices
(ideally via REST or an offline API if possible).

/geocode?latlng=47.639548,-122.356957&language=en,fr

{ "ISO":{ Country:"US", Administrative:"WA", SubAdministrative:"King",
Locality:"Seattle", SubLocality:"Queen Anne" }, "fr":{ Country:"Etas-Unis",
Administrative:"Washington", SubAdministrative:"Roi County",
Locality:"Seattle", SubLocality:"Renne Anne" } "en":{ Country:"US",
Administrative:"Washington", SubAdministrative:"King County",
Locality:"Seattle", SubLocality:"Queen Anne" } }

~~~
vicchi
Can I dig a little deeper into this as your example has me indulging in some
furious head scratching.

Providing language synonyms makes perfect sense where these exist (cf: London
in English, Londra in Italian, Londres in French).

But your example implies translation of place names into their language
specific equivalent. Kings County in Washington state is, unless I'm mistaken,
Kings County in all other languages. Although the local residents may
disagree, this county isn't blessed with a language synonym as it doesn't fall
into the (ill defined) category of "well known place with a language variant".

Unless you're suggesting that if, say, French is requested as a language, a
geocoder should translate place names so "Kings County" would (maybe) be
"Comté Roi" in French. Although this approach sounds odd to me as (AFAIK) no
one else refers to this place in this way?

~~~
troysandal
Sorry for the confusion, let's see if I can clear this up. We'd like to see
these locales translated to the same names that the native map program on a
device would show (which is what I rely on now). For example, on my iPhone,
when I switch to French and I'm in the Mission District of SF it says "Etats-
Unis" / "Quartier de la Mission". Actually, that's a bad/rare example, a
better one is a multi-lingual country like Switzerland where you can have 3
languages at once for some cities/neighborhoods. I want to pass the locale of
the speaker to the API and get back what they'd expect in their local dialect.

Make sense?

~~~
vicchi
Makes a lot more sense, yes. Thanks for this.

I do have to ask how realistic this is. It makes sense for places that do have
multi language versions. So "Etats-Unis" for the US, when in French, makes
total sense.

But does translating the Mission to "Quartier de la Mission" make sense? It
makes me go "errrr?".

To put it another way, taking the ever present UK "High Street" as an example.
I'd expect to see this as "High Street" regardess of language and not as
"Grande Rue" because no one ever says this of uses it.

So yes, I think passing the locale to the service makes a lot of sense.
Supporting those places which have multi language versions makes a lot of
sense too. Translating all place names to a specific language makes less
sense.

Or am I missing the point? It's always probable.

------
thomersch_
God, no. Everyone is trying to build a Geocoder and everyone is failing,
because no one is actually realizing that geocoding is probably the most
complex topic in GIS.

"Yeah, let's put some Elasticsearch and PostgreSQL and it will work out fine."
No, it won't - you have no idea. And of course you won't believe me, but let
me list some problems you will have that you don't realize right now:

* There is a lot of different charsets. Latin, kyrillic, there are umlauts, RTL, weird abbreviations, language standards that you don't know, because you don't know enough about foreign cultures.

* It's a shitload of data: OpenStreetMap is expanded about 700 GB large (not including history). And you will want to have autocompletition or autosuggestion, so response times will have to be < 100 ms.

* Ranking. Your user types "Tokyo". Is it the restaurant next to the user, is it the capital of Japan or is it some village next to Shitfuckistan?

No matter what, it will take you about a year to get any usable result. So I
suggest you to look into Nominatim (the standard geocoder of OpenStreetMap
which has actually got a lot better) or Photon (a geocoder based on the
Nominatim DB, but with auto suggestion).

~~~
nodata
> or is it some village next to Shitfuckistan?

Wow, where did that come from?

~~~
natch
I'm going to guess it comes from the frustration of getting dumb results
rankings that don't take into account the likelihood that the result is
actually something you wanted.

------
jessebushkar
I've used pretty much all of the big geocoding services, and here are problems
I've ran into.

1\. Rate limiting: I get it, you have to make money and/or limit your
freeloading, but rate limiting has killed things I've built in the past,
especially Google's hard rate limit. A soft rate limit, or an alternate way to
monetize, would be huge.

2\. Accuracy: MapBox's geocoder is not good. Aside from inaccurate map tiles,
their geocoder misses entire US zip codes. PLEASE at least include helpful
error messages and a path to report incorrect results.

3\. A solution for shared IPs and rate limiting. I have helped several small
websites that do not come close to approaching Google's daily rate limit, but
because their IP was used by someone else, they are not allowed to make
geocoding calls. This forced us to use a different service.

Honorable mention: It would be nice to be able to specify what data I get back
from a call. If all I need is lat/lng, I don't need another kilobyte of
neighborhood/city/time zone info in my result.

Hope this helps.

~~~
freyfogle
It helps a lot, thanks.

re: alternate ways to monetize, what do you propose?

~~~
Jemaclus
Two options the way I see it:

* Offer unlimited access for more money (popular) * Offer a cheap, simple way to batch requests of very large and/or unlimited sizes (e.g., CSV upload)

The latter is nice, because it's not worth it for me to upload a CSV file with
1 address in it -- I might as well use your regular API. And it's not worth it
for just 25 addresses, either. There's some threshold where it becomes more
useful for me to submit my addresses in bulk, and that's where the CSV files
come in. It should be way cheaper for you to process a file with 1 million
rows than it would be to process 1 million API requests, so if you were to do
that, it would be a gold mine for businesses like mine that require geocoding
capabilities for millions of addresses at a time.

~~~
thecodemonkey
I really don't want to hijack the thread, but I couldn't help notice that the
company I work for[1] recently added a lot of these features which might be of
interest.

We have support for both bulk CSV upload and an API endpoint for batch
geocoding. We are also starting to introduce unlimited access for a flat
monthly fee (with no limits to requests per sec), please contact us if you're
interested [2].

[1] [http://geocod.io](http://geocod.io) [2] hello@geocod.io

~~~
freyfogle
Hi Mathias,

big fan of your company's approach, congrats on your progress. The one
weakness is that it's limited to just the US.

------
scraplab
An understanding of colloquial geography, such as:

\- informal place names

\- boundaries of neighbourhoods

\- nesting of those things within administrative boundaries

Yahoo's Where On Earth database had a lot of this, but it doesn't seem to be
available to download any more, and they didn't accept updates. GeoNames is
pretty messy and inaccurate, and the copyright status has never been cleared
up.

~~~
freyfogle
Hi Tom,

thanks for commenting. Big fan of your work on flickr neighbourhood
boundaries.

We hear you and are on it, which doesn't mean we'll be perfect of course, but
definitely aware of this issue.

~~~
Jemaclus
Also, beware of places like Carmel-by-the-sea, CA, where there are no street
addresses! None of the houses in downtown Carmel have mailboxes -- they all
have P.O. boxes at the post office downtown. If you try and geocode these, you
wind up with the post office lat/lng and not the house! Frustrating as hell...

------
micro_cam
* Support for and intelligent detection of a variety of coordinate formats including LAt/Lon, utm, and township and range. This would be really useful when dealing with old well or surveyors logs etc.

* Better support for terrain features. Google is getting better here but for a while "Mount Rainier" was sending you to the parks business office.

* Better support for localized search. This ties into the last one, a frequent use for me is to be zoomed in on a general area and want to find an obscure creek or peak.

* Better support for non driving use cases. Google has a nasty habit of resolving things like unquoted locations to the nearest drivable street address which is really stupid when you are using it to find a wilderness lake or something.

* Finer grained search by type.

(FWIW I run hillmap.com so most of my desires spring from the needs of a
service targeted at hikers and backcountry skiers.)

~~~
freyfogle
thanks, good feedback

------
lukecampbell
Simple.

I've worked in GIS for a number of years. I've worked on marine and scientific
data management on top of GIS support. From google maps/earth to ArcGIS and
pulling data from KML to OGC services.

If a service takes me nearly a month to learn to use, I'm going to push
adamantly to use something else.

------
seamusabshere
* API client with batching and parallelization built in (100 queries in a single request, multiple requests run in parallel, etc.)

* robustness in the face of bad street suffixes (for example, in Burlington, VT, you may find data with "CR" meaning "CIRCLE" instead of the official USPS "CREEK")

* fuzzy street name matching (PAKCER -> PACKER)

* accurate geocoding in rural United States

* fuzzy international place matching (like "ST PANCRAS ST STATION" in London)

~~~
freyfogle
In your experience is the rural US geocoding problem a software problem or a
lack of underlying data?

------
gloubibou
Don't forget desktop and mobile applications. Most mapping and geocoding
services do.

\- I have to ship my API keys to end users. Someone could grab it and
repurpose it \- Rate limiting by API key penalizes one end customer for the
other's misbehaving

I would love an API that is aware of the end user. Applies rate limiting on a
user basis. Allows for anonymized user-based usage report. E.g. number of end
users, average number of API calls by users, …

~~~
freyfogle
your comments generated a lot of discussion for us here, thanks. Our
conclusion: if you build an app (mobile, desktop, whatever) that becomes
popular and depends on a third party service, in our case a geocoder, it
generates real costs for the thidd party service. So there are three potential
groups who can pay the cost

a. the third party service provider can just provide service for free. We
can't, at least not indefinitely.

b. the end consumer can somehow be billed by the third party service. Feels
complicated, especially as the use of the service may be deep in the internals
and behind the scenes of the app. The consumer may well have no idea it is
being used

c. the application developer can pay. Either directly or via billing the end
consumer.

Option c. feels like the only sustainable one. Happy to hear your thoughts on
it though.

~~~
gloubibou
Option c is the way to go. Up to the app developer to see how to monetize the
app.

Bonus points if the end user can be identified. E.g. if the app can pass an
opaque token to the web service. Reporting / billing from the web service
provider groups usage by token.

------
freyfogle
Thanks everyone for the feedback, very useful. Please keep it coming. I need
to be offline for a bit, but will check in later. If you're interested in
learning more about our progress please follow us on twitter. ta.

[https://twitter.com/opencagedata](https://twitter.com/opencagedata)

------
tzaman
Apart from the most obvious (being accurate) I would say well documented API
and properly localised results

~~~
freyfogle
thanks for commenting. If you don't mind, what exactly do you mean by "well
localised"? Can you give me an example, ideally via a service you're currently
using that is doing it badly. Cheers.

~~~
singlow
The Google geocoding service returns a normalized address in addition to the
lat/lon. There are times I need that address localized for the area it is in,
and other times I need it localized for another culture. It would be very
useful to be able to specify which locale the return value should be.

~~~
freyfogle
makes sense, can you give me a specific example?

~~~
natch
In China all the maps are in Chinese. There are times when I want them in
English. I assume this is what the poster means.

~~~
tzaman
Partially yes, plus the formatting, for example, some locales have zip in
front of the city while some do it the other way around.

------
hernantz
Get the timezone (and related information: utc offset, local time, etc) from a
lat, lng point or address

~~~
thecodemonkey
Hi Hernantz,

Shameless plug, but this is something we recently started offering [1] for our
geocoding service. I'm happy to help if you have any questions.

[1] [http://geocod.io/docs/#toc_21](http://geocod.io/docs/#toc_21)

------
dpcan
Quick and cheap daily batch geocoding with any type of export option, from CSV
to JSON to XML.

~~~
freyfogle
Not sure what your definitions are of cheap or quick, nr of what country your
data is in, but there are lots of people who do bulk geocoding. Why don't you
use them?

~~~
dpcan
I do use others. The op here was asking what we want in a geocoding service so
I told him.

But to elaborate on what I mean by quick and cheap:

Quick means I don't have to wait for an email notifying me that it's done, and
I don't want my requests queued for a couple hours. I want them upon request.

Cheap means better than the average price of the competitors. It would be nice
to maybe pay $100 and be able to geocode 100K addresses, for example (but I
don't really have a huge sample of competitor prices).

~~~
freyfogle
relax dude, I am the OP. Just trying to get specifics not general terms like
"cheap". For one person $100 is rounding error, for the next it's a meaningful
chunk of project budget. Thanks for clarifying.

------
vgrichina
Publicly visible and flexible pricing. Not $10k+ per year as Google Maps API.

------
BugBrother
Fast for single queries, not for batch geocoding?

So a web query can come from a user of a web service, the external(?)
geocoding API can be called -- and the reply can go back to user [applying
lon/lat processing] without waiting too long.

(I don't do anything like this for a while, so please apply NaCl.)

