
How to Start Your Own GeoCoder in Less Than 48 Hours - bitboxer
http://klaustopher.github.com/blog/2012/10/15/how-to-start-your-own-geocoder-in-48-hours/
======
codewright
Kinda weird to see people going through this immense pain to get the whole
import-pipeline and efficient search problem solved with OSM and Postgres +
PostGIS when it's very much a "solved" problem.

Incremental updates and intelligent, flexible, efficient search are all
immediately doable with existing open source software.

Why are people hellbent on using Postgres where it's suboptimal for a dataset
this large that needs intelligent searching?

Seriously people: <https://github.com/ncolomer/elasticsearch-osmosis-plugin>

Learn the JSON document and query formats, and then proceed to jump with glee
whenever you encounter a problem well served by a search engine instead of
doing hack-neyed manual indices and full-text queries and poorly managed
shards on Postgres.

Postgres is for important operational data. Excellent at that. Not so great
for search, bulk, or static datasets.

ElasticSearch is so well-designed that I just expose a raw query interface to
it to our frontend JS guy and he just builds his own queries from the user
inputs.

ElasticSearch is probably like 20-30% of the technological secret sauce of my
startup.

~~~
klaustopher
I agree, the data is not optimized for searching since it's supposed to serve
all sorts of purposes.

We didn't look into elastic search, since we wanted to give ArangoDB a try. We
will have a look into it, thanks for the hint!

~~~
muxxa
Just wondering if you played with the hStore column in the postgres ways/nodes
tables before diving into ArangoDB? I see hStore as nosql-on-demand within a
relational schema: <http://www.postgresql.org/docs/current/static/hstore.html>

~~~
klaustopher
Nope we didn't, as I said in the post, it was one of the things we decided up
front to use ArangoDB.

It is developed locally and we wanted to try if it scales up and assists us,
or if we should go the "traditional" Postgres way that everybody else goes.

------
robmil
_Another problem I see is that we have a snapshot of data from friday. We
cannot really link our data back to any of the original OSM data. So if we
want to upgrade our dataset, we have to throw everything away that we have and
start a new import._

This is the biggest hurdle to overcome, in my experience. A custom data format
is typically essential (most location databases arrive as CSVs or XML, which
are useless for real-time querying), but imports can take forever.

It's sometimes, counterintuitively, been more worthwhile to concentrate on the
performance of importing than of querying; the out-of-the-box query
performance you get with (no)SQL often isn't terrible, but your import script
usually starts out pretty awful.

~~~
klaustopher
Yeah, it basically is ... We use osmosis to preparse the data and than just
parse the XML ... It is a pain.

I think what we can try is to import the data in the same format they have it
in OSM and use smart indexes within the database to issue queries quickly. I
think this will be the major investigation when going forward.

~~~
michaelt
Last weekend I did some work with the OSM planet file - the thing with the XML
format is it took several hours just to decompress it - even though I was
reading it from RAM on an EC2 m2.2xlarge instance. And after that it still
took an age to parse all the XML. All told it took 24 hours just to decompress
the file and do a three-pass parse.

With the benefit of this experience, I decided it was worth switching to OSM's
alternative 'PBF' format [1]. It's a dense binary format that doesn't require
additional compression. It's also reportedly 6 times faster to read than
gzipped XML. Honestly it seems very complicated to parse, but if you're
willing to work with Java or C there's a parser already available. [2]

[1] <http://wiki.openstreetmap.org/wiki/PBF_Format> [2]
<https://github.com/scrosby/OSM-binary>

~~~
klaustopher
Yeah, next time we will use the PBF, but there's currently no ruby parser for
this. If we continue to regularly parse the data, we will have to write a
wrapper around the C parser.

Thanks for the link

~~~
pygy_
One of the best protobuf parsers is Haberman's upb. There are no Ruby
bindings, but it is built with dynamic languages in mind. There are already
Lua and Python bindings, you could use them as example.

<https://github.com/haberman/upb/wiki>

<https://github.com/haberman/upb/tree/master/bindings>

------
peteretep
For parsing of US and UK addresses, you can look at the internal routines
address identification and cleaning routines of Ziprip -
<http://zipripjs.com/>

------
xradionut
People that write their own geocoder from crappy sources are like people that
write their own crypto libraries. It's a task best left up to experienced
experts. There's a reason we pay for quality GIS software with good data or
use services like Google, Bing, etc...

~~~
jsnell
What a bizarre thing to say.

First, there is real value in having code like this available as open source,
and working using open data. The analogy to crypto would only hold if there
were already good open alternatives around. It doesn't sound like that's the
case. But second, crypto is basically a solved problem with clearly defined
but subtle best practices. Geocoding is totally different. There's plenty of
room for experimentation and totally new approaches that wouldn't fit into
existing frameworks. I don't see how discouraging that experimentation can
possibly be in anyone's best interest.

(Since credentials are being asked for: I used to work on the geocoder of
Google Maps, as well as on geocoding data quality issues.)

~~~
xradionut
My comments may have stemmed from some discussions I had with some programmers
I had at a conference this weekend. Try explain address
cleanup/standardization and geocoding to people who think it's an easy problem
since Google Maps does it "magically" for them.

