Hacker Newsnew | comments | show | ask | jobs | submit login
How to Start Your Own GeoCoder in Less Than 48 Hours (github.com)
34 points by bitboxer 949 days ago | 29 comments

Kinda weird to see people going through this immense pain to get the whole import-pipeline and efficient search problem solved with OSM and Postgres + PostGIS when it's very much a "solved" problem.

Incremental updates and intelligent, flexible, efficient search are all immediately doable with existing open source software.

Why are people hellbent on using Postgres where it's suboptimal for a dataset this large that needs intelligent searching?

Seriously people: https://github.com/ncolomer/elasticsearch-osmosis-plugin

Learn the JSON document and query formats, and then proceed to jump with glee whenever you encounter a problem well served by a search engine instead of doing hack-neyed manual indices and full-text queries and poorly managed shards on Postgres.

Postgres is for important operational data. Excellent at that. Not so great for search, bulk, or static datasets.

ElasticSearch is so well-designed that I just expose a raw query interface to it to our frontend JS guy and he just builds his own queries from the user inputs.

ElasticSearch is probably like 20-30% of the technological secret sauce of my startup.


A key feature of spatial databases, for me, is being able to store and query shapes more complicated than just points.

In particular, roads with long straight segments don't have many nodes, so the nearest road isn't always the road the nearest node is on - and a road that travels through a bounding box won't always have a node in the bounding box.

Does ElasticSearch support indexing on geometry more complicated than points?


ElasticSearch isn't a spatial database per se, it's an exceptional search engine. It natively understand geo points, radii, bounding boxes, polygonal geo filter, geo faceting etc.

If it can be represented as a geo point or a composite of multiple geo points, then ElasticSearch can grok it. Otherwise, no.

If you want to query arbitrary paths, that's on you to bridge the gap between a spatial database and a graph store.

I'm not really sure what you're looking for. This post was about OpenStreetMaps.


Then, in answer to the question "Why are people hellbent on using Postgres" I'd say it's to do the kind of searches I described, which are a native feature of spatial databases like PostGIS and Oracle Spatial.


You never actually clarified as to what you wanted that wasn't included in the list I provided.


Here is a diagram: http://imgur.com/yHbS7

I store paths which are comprised of an ordered sequence of points. Depending on what spatial tool you're using, you might call this a path, a linestring, a line, or an ordinate array. In the diagram the points are black dots and the path is shown in purple.

I want to do a bounding box query - finding the paths that are entirely or partly inside a given box. In the diagram, the box I'm querying for is shown in red. As you can see in the diagram, the purple path passes through the red box, but none of the black points defining the path are within the box.

I can accomplish this with a single query using Oracle Spatial [1] or PostGIS [2]. It requires that the spatial database understand shapes more complicated than just points. Can elasticsearch? There aren't any examples of this I can find in the documentation.

[1] http://docs.oracle.com/cd/B12037_01/appdev.101/b10826/sdo_op... [2] http://postgis.refractions.net/docs/ST_Intersects.html


You could do this pretty easily with ElasticSearch given that vectors and paths are composed of start/end points, with varying degrees of customizability, but whether or not that would be a good idea depends a great deal on your query patterns and how much customizability and scalability you need.

The strengths of ElasticSearch are in trival sharding and replication intelligent, fast, and soft real-time search.

It's also got a very powerful, easy to understand, highly programmable query syntax that is very easy to generate in code.

It's not a spatial database and what I was originally talking about wasn't designed to solve pathing/graph traversal, but you could still do n-dimensional indexed spatial search in ElasticSearch and that is something I do on a regular basis although it's not the "base" use-case for their geo API.


I agree, the data is not optimized for searching since it's supposed to serve all sorts of purposes.

We didn't look into elastic search, since we wanted to give ArangoDB a try. We will have a look into it, thanks for the hint!


ArangoDB isn't designed to solve the same problems as ElasticSearch, it's a database/data store.

ElasticSearch is a search engine, first and foremost, and while you could use it as a database-of-first-resort, I'd be hesitant to recommend as much. For one thing, it doesn't take durability very seriously.

As a result, I have to assume you chose wisely if you're using ArangoDB for a standard database use-case.


Just wondering if you played with the hStore column in the postgres ways/nodes tables before diving into ArangoDB? I see hStore as nosql-on-demand within a relational schema: http://www.postgresql.org/docs/current/static/hstore.html


Nope we didn't, as I said in the post, it was one of the things we decided up front to use ArangoDB.

It is developed locally and we wanted to try if it scales up and assists us, or if we should go the "traditional" Postgres way that everybody else goes.


Another problem I see is that we have a snapshot of data from friday. We cannot really link our data back to any of the original OSM data. So if we want to upgrade our dataset, we have to throw everything away that we have and start a new import.

This is the biggest hurdle to overcome, in my experience. A custom data format is typically essential (most location databases arrive as CSVs or XML, which are useless for real-time querying), but imports can take forever.

It's sometimes, counterintuitively, been more worthwhile to concentrate on the performance of importing than of querying; the out-of-the-box query performance you get with (no)SQL often isn't terrible, but your import script usually starts out pretty awful.


The export formats of geospatial data tend to be extremely inefficient to parse. Unfortunately, the very high computational cost of parsing is multiplied by the very large size of the data. Formats like XML, JSON, and CSV are only convenient when the absolute size of the data being parsed is relatively tiny.

I consider efficient export formats for geospatial data (or similar rich/complex data sources) to be a bit of an unsolved problem. It is not difficult to design storage formats that are literally a couple orders of magnitude faster to process but the formats most people are using were designed for files small enough that parsing efficiency doesn't matter. Consequently, at my company we spend time designing highly optimized parsers for inherently inefficient formats and designing non-standard internal export formats that nothing else understands but which are nonetheless vastly faster to use at scale. It is a big problem that it seems like it should be solved by now.


Yeah, it basically is ... We use osmosis to preparse the data and than just parse the XML ... It is a pain.

I think what we can try is to import the data in the same format they have it in OSM and use smart indexes within the database to issue queries quickly. I think this will be the major investigation when going forward.


Last weekend I did some work with the OSM planet file - the thing with the XML format is it took several hours just to decompress it - even though I was reading it from RAM on an EC2 m2.2xlarge instance. And after that it still took an age to parse all the XML. All told it took 24 hours just to decompress the file and do a three-pass parse.

With the benefit of this experience, I decided it was worth switching to OSM's alternative 'PBF' format [1]. It's a dense binary format that doesn't require additional compression. It's also reportedly 6 times faster to read than gzipped XML. Honestly it seems very complicated to parse, but if you're willing to work with Java or C there's a parser already available. [2]

[1] http://wiki.openstreetmap.org/wiki/PBF_Format [2] https://github.com/scrosby/OSM-binary


Yeah, next time we will use the PBF, but there's currently no ruby parser for this. If we continue to regularly parse the data, we will have to write a wrapper around the C parser.

Thanks for the link


One of the best protobuf parsers is Haberman's upb. There are no Ruby bindings, but it is built with dynamic languages in mind. There are already Lua and Python bindings, you could use them as example.




Sounds interesting — it would definitely be worth a follow-up post if you do manage to get that working!


For parsing of US and UK addresses, you can look at the internal routines address identification and cleaning routines of Ziprip - http://zipripjs.com/


People that write their own geocoder from crappy sources are like people that write their own crypto libraries. It's a task best left up to experienced experts. There's a reason we pay for quality GIS software with good data or use services like Google, Bing, etc...


What a bizarre thing to say.

First, there is real value in having code like this available as open source, and working using open data. The analogy to crypto would only hold if there were already good open alternatives around. It doesn't sound like that's the case. But second, crypto is basically a solved problem with clearly defined but subtle best practices. Geocoding is totally different. There's plenty of room for experimentation and totally new approaches that wouldn't fit into existing frameworks. I don't see how discouraging that experimentation can possibly be in anyone's best interest.

(Since credentials are being asked for: I used to work on the geocoder of Google Maps, as well as on geocoding data quality issues.)


My comments may have stemmed from some discussions I had with some programmers I had at a conference this weekend. Try explain address cleanup/standardization and geocoding to people who think it's an easy problem since Google Maps does it "magically" for them.


OpenStreetMap is no crappy source anymore. Have you looked at the data recently ?


How much experience with GIS and mapping do you have to make a statement like that? OpenStreetMap is still not good enough to satisfy many of our clients, nor does it meet the expectations of government regulators. With GIS layers and geocoding there's value in paying for quality tools, data and support. Apple found out the hard way.


Whoever thinks they can create something that has "government approved quality" in 48 hours is crazy. We are not crazy.

As stated before, we wanted to use the data that is already there and make it available in a userfriendly way, something the current geocoders for OSM don't really do.

When you talk about the claim I make, that I wish OSM is going to be the first goto place when people want to search something on a map is still valid. It's my wish, not a general statement I'm making. And hey, it worked with Wikipedia. For many people it's the first place they go to when they are looking for information, so I think this can also work for OSM


I do admire your effort in making OSS and data more usable to the end user. I'm probably biased since I have some experience in this field, but OSM isn't usually an option I would select for personal projects or casual lookups.


Where I found my "love" for OSM was when I bought a Garmin GPS and I used the OSM bike maps. Here in Germany they are very detailed and I have made 500km or so bike tours only relying on OSM maps. Whereas Google still doesn't have that many information about bike tracks.

When I need to look up stuff I usually go to Google Maps, too, but that's only because the OSM solutions out there frustrate me with irrelevant results and I have to scroll through endless amounts of data to get to the information I'm looking for. :)


Apple what?

Apple didn't really use OSM. The current OSM is pretty good, by the way. It's just a pity some data is still wrong (for example, many postcodes in Madrid, ES).


Apple under estimated the effort required to have a quality mapping product. When I first started dealing with GIS, I used to cringe when I saw the bills for the software and the various data/layers, now I understand the pain in the creation process.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact