I work for UPS as software developer and surprisingly, I work for the department that is responsible for parsing addresses and matching them with actual physical addresses. We cover US, CA & EU (incl. UK).
In our department, we have a guy whose entire career at UPS is nothing but maintaining library that parses addresses. It is very hard to get things right, unless you maintain that library all the time.
Genuinely wish you good luck !
Anyways, good luck to this developer. I don't think anyone will ever produce a solution that works better than all others, but it is better if more of us try.
I think that there is a good solution: supervised learning/segmentation with direct user verification.
You have the user enter a free form address, and then translate it into a structured address. If they correct any fields, you look at those and try to figure out if the final result is correct or not, and integrate that.
Maybe this could be done as a service with iframes (like ReCaptcha); and since the information in a full address is basically entirely public knowledge (at least in the U.S.), you can keep all of it around in full detail.
Not your issue, but it makes me so angry that UPS charges me for address corrections ($12 a pop). The UPS supplied software often fails to identify missing suite number, can't figure out "State Route 123" vs "SR 123", etc. UPS is financially rewarded for bugs. Argh :)
Sorry to hear that.
I would advise you to call UPS and dispute those charges. If the address existed for a while and you can google it, than it is definitely a bug.
Appreciate it. It's more the policy that's irritating. For example, "missing suite" usually just means the driver has to look at the business name on the shipping label and identify it in the strip mall. $12 charged to the shipper for a few seconds of thought where the UPS software didn't identify "missing suite".
Already the name vs description reveals confusion: a street address and postal address does not have a 1:1 correlation even before taking postal codes/zip codes into account...
EDIT: examples includes differences in e.g visitor address vs. where mail delivery should happen; leaving out or adding details for one or the other (e.g many rural places you don't need to include road details for postal addresses).
Different people also address the same location differently. E.g I regularly have to tell delivery companies my address is in Surrey, even though my house has been in London for more than 50 years.
libpostal is a pretty incredible open source project, but addresses are so complicated and nuanced that depending on what you’re doing, it might not be able to keep up. I work for a real estate tech company where we do a lot of address parsing and we had to move away from it because it’s just not quite powerful enough to handle all of the edge cases you find in US addresses. Right now we use SmartyStreets because their address parser is a bit better for our use case. Libpostal is a great general purpose library but depending on what level of accuracy you need, you might have to look for alternatives.
I spent time trying to use libpostal and build USPS address normalization rules on top of it but there are so many edge cases it was more cost effective to just purchase a solution from a vendor.
That is not to take away from this project — it’s quite good for a broad set of addresses across the world — but for narrow use cases such as ours it just couldn’t quite cut it.
Nice work, although it is a bit slow: (avg 5 seconds per address parsing)
date && perl geo.pl && date
Sat Dec 29 16:18:30 UTC 2018
country => united kingdom
suburb => shoreditch
house => the book club
city => london
postcode => ec2a 4rh
road => leonard st
house_number => 100-106
Sat Dec 29 16:18:35 UTC 2018
It seems to have a whole lot of data that would probably need to be loaded. Maybe a lot of that time is spent in initialization. Is it faster at parsing a second, third, etc. address once loaded?
By the way, a more convenient way to benchmark Perl:
perl -MBenchmark -e 'timethis(500, sub { ... your code here ... });'
> Street addresses are among the more quirky artifacts of human language, yet they are crucial to the increasing number of applications involving maps and location.
The main goal seems to be positionning a point on a map.
As pointed out by the other comments, it’s fairly different from dealing with delivery addresses or legal addresses.
In particular it means parsing locations inside buildings (i.e. “3 appt of 2nd floor”, “Building 103 - code 17234, 34 foobar street”) with random info baked in for humans could easily trip it up and are not expected to either work properly.
Still looks like a pretty ambitious and interesting effort.
It explicitly is not a geocoder (which is address->location), it just parses and normalizes addresses.
It's meant to deal with something like "3 appt of 2nd floor", parsing and tagging "3 appt" as unit=apt. 3 and "of 2nd floor" as level=2, even if that string is mixed with further info like street and city and so on.
Building strings in C can be painful, but parsing them is not too bad. I suppose security is another big concern, especially with something built explicitly to process user-supplied data.
The libpostal developers have released bindings for a number of different languages, which can be found at the Github organization page: https://github.com/openvenues
Well, this already fails for places that don't address by street. You might think it's only some pre-industrial villages in the jungle, but examples would be some eastern European countries and Japan - some (but not all) buildings simply don't have a street address. Instead they have a number within a district. But sometimes it's a building number on a street, but it's distinct from the street's numbering system, so you can have Building 5 on st. Foo as a distinct address from 5th Foo st., where there is a completely different building. And of course there's no number on Foo st. that corresponds to Building 5. Another fun case is when there's a district Foo and a street Foo and e.g. Google Maps resolves "district Foo, building 5" as "No. 5, Foo st.". Or when the district has a number in it, so "district Foo 3, building 275" resolves to "district Foo, building 3", because of course the first Foo doesn't have a number in it - there's no Foo 1, only Foo, Foo 2, etc.
Generally all residential buildings built by the communist regime follow that system, while older buildings follow street numbers. Open Street Map actually deals amazingly well with our addresses, while Google Maps fails miserably most of the time. This is starting to become a problem as online services here are integrating google's mapping technology, e.g. an app for hailing taxis would ask you to type in your starting and destination address and if google can't make sense of it, the taxi can go to some completely wrong place. I can deal with it fine since I live here, but woe be any foreigner that would rely on Google Maps.
The postal system works just fine, but sometimes I have to enter a district name in the street field in online forms. As long as it arrives to the right country, the postal workers here can make sense of the address just fine.
All of these shenanigans do work in a hierarchical way, so you can pretty much expect to always have City, City sub-unit, Building designator as your address schema, but the actual category of City sub-unit and Building designator is sometimes "Street/number", sometimes "District/building number". You can of course simply ignore that and not have your system work in weird places, but if you're making a library for wide use and publishing it, I would appreciate it if you take into account that not everybody addresses by street/number.
It worked quite fine for fairly approximate things like "Tetuán, Madrid, España" (district, city, country). However, it seemed to exhibit a tendency to attribute suburbs and districts as houses, at least with the handful of Madrid addresses I happened to have at hand.
If you have a use case for it and data to match, why not give it a try?