Hacker News new | past | comments | ask | show | jobs | submit login
Parsing city of origin and destination city from a string (stackoverflow.com)
104 points by polm23 on July 6, 2020 | hide | past | favorite | 31 comments



After reading Steven Pinker's book The Stuff of Thought in which he describes verbs as the chassis of a sentence, I thought we would see various Natural Language Understanding algorithms that leveraged "verb classes". Beth Levin published a book called Verb Classes and Alternations which is effectively a database/index of a subset of verbs. Each verb takes one or two arguments and the arguments can appear on the LHS, RHS, or both (i.e. alternations). You can get a feel for them from a 2011 paper by Levin [1].

Normally you would start with a Parts of Speech algorithm but with the limited context of flights (international only?) you could focus only on a "to" clause and a "from" clause with the city/country found in the RHS of each clause.

Resolving ambiguous cities could start with a simple ranking algorithm based on a list of International/Domestic destinations and their corresponding traffic level.

NLU is hard but constrained domains like this are tractable, in my overly optimistic opinion.

[1] https://web.stanford.edu/~bclevin/mpeva11vclass.pdf


I think the top answer is right philosophically in terms of putting entity recognition first and predicate recognition later.

Trying to do it with grammar is exactly why you get into the rabbit hole where you get mixed up about Old York and New York. There is that "New York City" that daddy doesn't live in no more, and the "New York State" I live in where you frequently find dirt roads that look passable on GPS and paper maps but aren't.

It is an ontological problem, not a grammatical problem.

You are right that "air traffic level" is key in curating your city and/or airport database. You would start with a list like

https://en.wikipedia.org/wiki/List_of_busiest_airports_by_pa...

and work down from the top. At some point you can put together a "proof" that it works for the top 20% of the airports that get 80% of the traffic and work up from there to cover 95% of the traffic.

For any project like this you need to build a test set if not a training or evaluation set. If you are not willing to look at 1000 examples by hand then don't get started.

People are willing to spend six months fighting with everything from procurement to device drivers to run the same models on the same digits that everybody else runs, but it is hard to get people to make evaluation sets.

I can't guarantee that the people who make those sets will succeed, but it is certain that those who don't make them will fail.


If only someone made a version of this answer that was more, let’s say, board-room accessible. So much conflict could be avoided, if engineers were not seen as ”non-believing naysayers” when top brass wants some of that hot new AI


"I will personally pay $100 to anyone who can make this without a bug and passes every case out there. But if they take up the offer and fail, they have to pay me $1000"


A: You have a finite number of strings a > b

B: Each has a finite number of ways of spellings.

A x B is like, a large number. It would take a lot of type writers and there might be new ways of spellings, new desitnations and new places to abandon... ehh I mean leave.



...except that recognizing birds is pretty trivial these days if you let somebody else do the heavy lifting: https://cloud.google.com/vision

Which is the OP's issue as well: they'd be better off letting somebody else like Google parse their text instead of trying to reinvent the wheel.


and indeed that comic (published September 2014) is more than 5 years old now. Seems accurate.


If this was not frowned upon on StackOverflow, the correct answer would recommend a commercial API.

That said, if I were to implement this from scratch, I would not start with a positive list of cities. Rather, I would implement a couple of patterns like: "from X to Y", "X to Y", ""to Y from X".

But invariably there are going to be ambiguities. How do you deal with a city called "from to" :-D? How to you parse "from from to to to from"? I'd probably tokenize it using place lists, like

    "from (from to) to (to from)"
    "from (from to) (to to) from"
    "from from (to to) to from"
    ...
and then, maybe assigning probabilities to each parsing, decide that the first one is most likely.

A side comment, for someone like me who isn't an NLP expert, it seems strange that there are no libraries that let me parse a string accoring to some rules in all possible ways (like a tree) and pick the one that is most probable. I'm sure this exists, but it is hard to search for...


> A side comment, for someone like me who isn't an NLP expert, it seems strange that there are no libraries that let me parse a string accoring to some rules in all possible ways (like a tree) and pick the one that is most probable. I'm sure this exists, but it is hard to search for...

I think you're describing the Viterbi Algorithm.


It's a longstanding problem that a very large number of parse trees (1000s) can be found for sentences of any real complexity: there are so many of them that the one with the highest probability of being right is still likely to be wrong!


Any natural language parser will let you get the most probable parse (for some definition of probable), some can also return a list of the n most probable parses. There are a few different types of parser implemented here https://nlp.stanford.edu/software/lex-parser.shtml


Yes I'm slightly surprised that it doesn't use any of the tricks of parser interactive fiction or even SHRDLU.


Rather than as a general parsing problem, isn't it possible to reformulate this in a different way that is much easier to solve by using the fact that these are strings describing flights we want to get destination and sources from?

So what you could do is take a general database of all flights (which is structured data and easy to work with) and then the problem is to label the unstructured input data with the probability it relates to a particular source, destination pair.

For example, there are no flights from York (lovely though York is) to Venice, so that mistake isn't one such a model could even make. Whereas there are a lot of flights to and from New York (including to Venice), and since that number is large it's a much higher probability labeling than some others.


This would help but it's not foolproof, since there is no way to be sure the strings refer only to direct flights between the two locations. Flight deals often involve transfers and/or complicated routings.


Doesn't have to be direct. It would at least show up as a possible route. Bonus if you score by airport size (where it relates to city not country of course)


If you have the money, give it to mechanical turk. If you have interesting content, use it as a captcha.


Didn't some "AI" companies get caught doing basically this? Even Alexa had people listening in to the audio commands, AFAIK if the command could not be understood.

The question smells like one of those "Solve my homework/work assignment for me" questions.


"We're using human-assisted-AI in phase one to enable a faster market entry, and are confident fully-AI-enabled technology is ready for rollout soon" is a surprisingly reliable way to get massively funded.

The horde of mechanical turks is really expensive, but with near-infinite VC capital, many labor-intensive white-collar markets can indeed be thoroughly disrupted.

Selling ten-dollar bills for five dollars each is guaranteed to gain traction, and you just need to convince investors that you're just a few quarters away from being able to print your own.


This is a bog standard modern NLP problem, I love how all the comments here are on how impossible it is using traditional heuristics and quoting various pop neuroscience authors and anecdotes. You only need a sufficiently large language model (and fine-tuned on short location strings) and run Named Entity Recognition on the input. Tricky? Yes. But there is only a finite number of prefix names in the English language and should automated solutions fail, you can always fallback to using a natural language form. A large neural network based solution running in the cloud would be able to get to 95+% accuracy. It would not be perfect but is good enough for most use cases.


Although the SO question is focused on the goal, this broad kind of task is quite a common requirement for chatbots, so looking at tools like RASA and Spacy for this are good places to explore (for those not opting for the commercial API approach others sensibly suggested)

https://rasa.com/docs/rasa/nlu/entity-extraction/

Spacy have a handy demo of NER (although it appears down for me right now): https://explosion.ai/demos/displacy-ent/


Just checked spaCy's NER demo and it seems to be working fine now. :)


In a former life I was tasked with exploring the complexity of extracting discrete information from resumes. Any format of resume. "100% complex", was my finding. It is amazing how much processing the human mind can do naturally.


The answer to this is the main attraction, a great overview of the process of a data project.


And when you've dealt with the problems of the text extraction like in the long answer, you're left with the next huge problem: Context and inferring intent.

In the example, the poster shows inferring full location, not just the name in the example output.

Some examples where that gets tricky:

If I want driving instructions "from London to Hamilton" odds are I want London, Ontario to Hamilton, Ontario.

But what about "from London to York"? Without knowing more, I can't tell if that is in England or Ontario, and "knowing more" here involves a lot of guesses about in what context someone would qualify a name based on where they live relative to different locations with those names.

If I want "London to Paris" I could mean London, Ontario to Paris, Arkansas, but odds are very much I don't, because it seems unlikely someone far from either place would leave it unqualified. If that someone is close to either Ontario or Arkansas, it's tempting to think they'd still refer to the bigger place... But I see enough Candians on e.g. London Craigslist assuming it's for London, Ontario that this seems to depend on additional context.

But here be dragons. Qualified names are often still tricky to interpret unambiguously. Let's say I've against all odds been to Paris, Wisconsin once, and I want to go back there. Chances are I might write "to Paris, Wisconsin" or similar... But which one? Paris, Grant County, or Paris, Kenosha County"? Or I might even write "to Paris, USA" if I wasn't painfully aware there's at least a dozen Paris-es in the US.

And what about when I mess them up? When I started writing the paragraph above, I first mis-remembered and wrote Paris, Wyoming.. Joy...

From experience, you can get decent results on average with a dataset with size and population and assume that the bigger or more populated a place is, the larger is the region where it is likely to be the best choice, however. But there will be many that remains ambiguous.

But when you look at names that are very common and where many alternatives exist that are of a wide range of sizes, spread all over the place, that too becomes difficult. Examples: "from Santa Cruz to San Jose". California, right, because you have both names in the same state? Except that also applies to Arizona. Or New Mexico. To know whether people there would be likely to talk about them without specifying more detail, you need additional knowledge for each of these locations, and e.g. try to guess when some of the locations are too small for even relative "locals" to ignore them. And those are just the examples in the US that have both names in the same state, but there are dozens of both names across the Americas and further afield.

(As you might tell, I've spent too much time having to deal with location data...)


Just a tiny piece of extra information for your example: there's also a town called Paris in Ontario within ~100km of London (roughly on the path between London and Toronto) that has a picturesque appeal, so it's not a totally unlikely local car tourist destination.

And sitting here in Toronto, with an available-to-google frequent travel history to London, Ontario, searching "london ON to paris ON" is not enough for google to know what I mean.

This stuff is hard.


Thanks - I was totally unaware there's a Paris in Ontario too.

I was literally just thinking earlier that at least if someone finds something to add/correct, it will be awesome because it will just emphasise just how hard this problem is even for humans.


I just asked Google Maps for directions from "Birmingham" to "Leeds" to see what would come up, and got the message: 'Sorry, we could not calculate driving directions from "Birmingham, Alabama" to "Leeds, United Kingdom"'. Sure, but there's a Leeds in Alabama (a suburb of Birmingham) and a Birmingham in the UK.

(I'm in Atlanta; it's probably a safe guess that if I were making this query legitimately I'm asking for the Alabama one.)

Similarly for Birmingham to Oxford; there are reasonably prominent towns called Oxford in Alabama, Missisippi, and maybe Georgia, but it still insists on attempting to find a transatlantic driving route and failing.


I have requested directions from Vienna to Paris before. ( https://www.bing.com/maps?osid=0dd3fe2f-ccc9-4899-8923-a5859... )

With enough context I could leave off any additional info when talking with a human, but there would likely never be "enough context" when asking a computer.


Can get quite far using a CRF solution. But nothing is close to perfect here.


While I enjoyed the long answer by alvas (well, they're suggesting you can only hit 100% accuracy with a rulebase; a man or woman after my own heart) I think the real problem the user asking the question, Merv Merzoug has, is not as complicated as it's made to be. Specifically, the user wants to parse successfully only a particular kind of English sentence - not arbitrary natural English sentences. What's more, those are sentences that are found in a particular, restricted context, that of strings that record travel details (most likely for some kind of travel agency and the like).

Given that those strings are going to be on the short side (you don't need to anticipate someone writing an epic saga in dactylic verse to describe a flight) the whole set of such sentences is probably finite and relatively small and can be represented by an equally finite and much smaller set of rules. So while alvas' comment is spot on regarding the amount of work one needs to do to cover all the exceptions and edge cases that might crop up even in such a small subset of natural English, it is not a problem so big as to be impossible to solve with some elbow grease.

There is much precedent of solving much more complex problems with hand-crafted grammar rules. For a very recent example, the latest online version of Magic the Gathering, M:tG Arena, apparently uses a hand-crafted context-free grammar to represent the language used on Magic cards [1]. The grammar is used in a parser that executes cards as scripts in a scripting language. Creating a grammar for the entire M:tG language is a job that a dedicated programming team can accomplish in a reasonable amount of time but even representing a small but substantial subset of that language is a goal that is in the range of, er, let's say ah a degree student of Computer Science in her final year :)

_______________________

[1] Just let me find that link. I got it around here somewhere...

Yep, here we go:

Magic Arena's GRE: The Groundwork for future digital products, in Chris Clay's Interview

To dispel any misconceptions: the GRP is not a machine learning approach to Natural Language Processing. It's more of a compiler/parser that uses rules and logic we've crafted. We have a dictionary that defines the parts of speech and some semantic data for "Magic Words". We also have a huge list of syntax rules for "Magic Sentences", similar to a Context Free Grammar. With this, and our good friend the Natural Language Toolkit, we generate a parse tree for ability text. The tree groups individual words into phrases, multiple phrases into sentences, and all the sentences under the ability as a whole. We then process that representation of an ability into higher-level ideas, specific to Magic. “Draw a card” is no longer just a verb and an object, but is an action and entity with regards to game play. We take verb 'atoms' (destroy, discard, exile) and use them to construct molecules ("triggered ability", "sequence of effects", "replacement effect"). With that semantic representation of the ability, we can then basically compile it into code. That code is then added to the GRE,and executes when you play Arena!

https://www.reddit.com/r/magicTCG/comments/74hw1z/magic_aren...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: