
Parsing city of origin and destination city from a string - polm23
https://stackoverflow.com/questions/59956670/parsing-city-of-origin-destination-city-from-a-string
======
sradman
After reading Steven Pinker's book _The Stuff of Thought_ in which he
describes verbs as the chassis of a sentence, I thought we would see various
Natural Language Understanding algorithms that leveraged "verb classes". Beth
Levin published a book called _Verb Classes and Alternations_ which is
effectively a database/index of a subset of verbs. Each verb takes one or two
arguments and the arguments can appear on the LHS, RHS, or both (i.e.
alternations). You can get a feel for them from a 2011 paper by Levin [1].

Normally you would start with a Parts of Speech algorithm but with the limited
context of flights (international only?) you could focus only on a "to" clause
and a "from" clause with the city/country found in the RHS of each clause.

Resolving ambiguous cities could start with a simple ranking algorithm based
on a list of International/Domestic destinations and their corresponding
traffic level.

NLU is hard but constrained domains like this are tractable, in my overly
optimistic opinion.

[1]
[https://web.stanford.edu/~bclevin/mpeva11vclass.pdf](https://web.stanford.edu/~bclevin/mpeva11vclass.pdf)

~~~
PaulHoule
I think the top answer is right philosophically in terms of putting entity
recognition first and predicate recognition later.

Trying to do it with grammar is exactly why you get into the rabbit hole where
you get mixed up about Old York and New York. There is that "New York City"
that daddy doesn't live in no more, and the "New York State" I live in where
you frequently find dirt roads that look passable on GPS and paper maps but
aren't.

It is an ontological problem, not a grammatical problem.

You are right that "air traffic level" is key in curating your city and/or
airport database. You would start with a list like

[https://en.wikipedia.org/wiki/List_of_busiest_airports_by_pa...](https://en.wikipedia.org/wiki/List_of_busiest_airports_by_passenger_traffic)

and work down from the top. At some point you can put together a "proof" that
it works for the top 20% of the airports that get 80% of the traffic and work
up from there to cover 95% of the traffic.

For any project like this you need to build a test set if not a training or
evaluation set. If you are not willing to look at 1000 examples by hand then
don't get started.

People are willing to spend six months fighting with everything from
procurement to device drivers to run the same models on the same digits that
everybody else runs, but it is hard to get people to make evaluation sets.

I can't guarantee that the people who make those sets will succeed, but it is
certain that those who don't make them will fail.

------
nikanj
If only someone made a version of this answer that was more, let’s say, board-
room accessible. So much conflict could be avoided, if engineers were not seen
as ”non-believing naysayers” when top brass wants some of that hot new AI

~~~
Pinus
[https://xkcd.com/1425/](https://xkcd.com/1425/)

~~~
9nGQluzmnq3M
...except that recognizing birds is pretty trivial these days if you let
somebody else do the heavy lifting:
[https://cloud.google.com/vision](https://cloud.google.com/vision)

Which is the OP's issue as well: they'd be better off letting somebody else
like Google parse their text instead of trying to reinvent the wheel.

~~~
tominsam
and indeed that comic (published September 2014) is more than 5 years old now.
Seems accurate.

------
captainmuon
If this was not frowned upon on StackOverflow, the correct answer would
recommend a commercial API.

That said, if I were to implement this from scratch, I would not start with a
positive list of cities. Rather, I would implement a couple of patterns like:
"from X to Y", "X to Y", ""to Y from X".

But invariably there are going to be ambiguities. How do you deal with a city
called "from to" :-D? How to you parse "from from to to to from"? I'd probably
tokenize it using place lists, like

    
    
        "from (from to) to (to from)"
        "from (from to) (to to) from"
        "from from (to to) to from"
        ...
    

and then, maybe assigning probabilities to each parsing, decide that the first
one is most likely.

A side comment, for someone like me who isn't an NLP expert, it seems strange
that there are no libraries that let me parse a string accoring to some rules
in all possible ways (like a tree) and pick the one that is most probable. I'm
sure this exists, but it is hard to search for...

~~~
polm23
> A side comment, for someone like me who isn't an NLP expert, it seems
> strange that there are no libraries that let me parse a string accoring to
> some rules in all possible ways (like a tree) and pick the one that is most
> probable. I'm sure this exists, but it is hard to search for...

I think you're describing the Viterbi Algorithm.

------
seanhunter
Rather than as a general parsing problem, isn't it possible to reformulate
this in a different way that is much easier to solve by using the fact that
these are strings describing flights we want to get destination and sources
from?

So what you could do is take a general database of all flights (which is
structured data and easy to work with) and then the problem is to label the
unstructured input data with the probability it relates to a particular
source, destination pair.

For example, there are no flights from York (lovely though York is) to Venice,
so that mistake isn't one such a model could even make. Whereas there are a
lot of flights to and from New York (including to Venice), and since that
number is large it's a much higher probability labeling than some others.

~~~
talliedthoughts
This would help but it's not foolproof, since there is no way to be sure the
strings refer only to direct flights between the two locations. Flight deals
often involve transfers and/or complicated routings.

~~~
robk
Doesn't have to be direct. It would at least show up as a possible route.
Bonus if you score by airport size (where it relates to city not country of
course)

------
lixtra
If you have the money, give it to mechanical turk. If you have interesting
content, use it as a captcha.

~~~
netsharc
Didn't some "AI" companies get caught doing basically this? Even Alexa had
people listening in to the audio commands, AFAIK if the command could not be
understood.

The question smells like one of those "Solve my homework/work assignment for
me" questions.

~~~
nikanj
"We're using human-assisted-AI in phase one to enable a faster market entry,
and are confident fully-AI-enabled technology is ready for rollout soon" is a
surprisingly reliable way to get massively funded.

The horde of mechanical turks is really expensive, but with near-infinite VC
capital, many labor-intensive white-collar markets can indeed be thoroughly
disrupted.

Selling ten-dollar bills for five dollars each is guaranteed to gain traction,
and you just need to convince investors that you're just a few quarters away
from being able to print your own.

------
sangfroid_bio
This is a bog standard modern NLP problem, I love how all the comments here
are on how impossible it is using traditional heuristics and quoting various
pop neuroscience authors and anecdotes. You only need a sufficiently large
language model (and fine-tuned on short location strings) and run Named Entity
Recognition on the input. Tricky? Yes. But there is only a finite number of
prefix names in the English language and should automated solutions fail, you
can always fallback to using a natural language form. A large neural network
based solution running in the cloud would be able to get to 95+% accuracy. It
would not be perfect but is good enough for most use cases.

------
nmstoker
Although the SO question is focused on the goal, this broad kind of task is
quite a common requirement for chatbots, so looking at tools like RASA and
Spacy for this are good places to explore (for those not opting for the
commercial API approach others sensibly suggested)

[https://rasa.com/docs/rasa/nlu/entity-
extraction/](https://rasa.com/docs/rasa/nlu/entity-extraction/)

Spacy have a handy demo of NER (although it appears down for me right now):
[https://explosion.ai/demos/displacy-
ent/](https://explosion.ai/demos/displacy-ent/)

~~~
walterkrankheit
Just checked spaCy's NER demo and it seems to be working fine now. :)

------
jnsie
In a former life I was tasked with exploring the complexity of extracting
discrete information from resumes. Any format of resume. "100% complex", was
my finding. It is _amazing_ how much processing the human mind can do
naturally.

------
polm23
The answer to this is the main attraction, a great overview of the process of
a data project.

------
vidarh
And when you've dealt with the problems of the text extraction like in the
long answer, you're left with the next huge problem: Context and inferring
intent.

In the example, the poster shows inferring full location, not just the name in
the example output.

Some examples where that gets tricky:

If I want driving instructions "from London to Hamilton" odds are I want
London, Ontario to Hamilton, Ontario.

But what about "from London to York"? Without knowing more, I can't tell if
that is in England or Ontario, and "knowing more" here involves a lot of
guesses about in what context someone would qualify a name based on where they
live relative to different locations with those names.

If I want "London to Paris" I _could_ mean London, Ontario to Paris, Arkansas,
but odds are very much I don't, because it seems unlikely someone far from
either place would leave it unqualified. If that someone is close to either
Ontario or Arkansas, it's tempting to think they'd still refer to the bigger
place... But I see enough Candians on e.g. London Craigslist assuming it's for
London, Ontario that this seems to depend on additional context.

But here be dragons. Qualified names are often still tricky to interpret
unambiguously. Let's say I've against all odds _been_ to Paris, Wisconsin
once, and I want to go back there. Chances are I might write "to Paris,
Wisconsin" or similar... But which one? Paris, Grant County, or Paris, Kenosha
County"? Or I might even write "to Paris, USA" if I wasn't painfully aware
there's _at least_ a dozen Paris-es in the US.

And what about when I mess them up? When I started writing the paragraph
above, I first mis-remembered and wrote Paris, Wyoming.. Joy...

From experience, you can get decent results _on average_ with a dataset with
size and population and assume that the bigger or more populated a place is,
the larger is the region where it is likely to be the best choice, however.
But there will be many that remains ambiguous.

But when you look at names that are very common and where many alternatives
exist that are of a wide range of sizes, spread all over the place, that too
becomes difficult. Examples: "from Santa Cruz to San Jose". California, right,
because you have both names in the same state? Except that also applies to
Arizona. Or New Mexico. To know whether people there would be likely to talk
about them without specifying more detail, you need additional knowledge for
each of these locations, and e.g. try to guess when some of the locations are
too small for even relative "locals" to ignore them. And those are just the
examples in the US that have both names in the same state, but there are
dozens of both names across the Americas and further afield.

(As you might tell, I've spent too much time having to deal with location
data...)

~~~
wcarss
Just a tiny piece of extra information for your example: there's also a town
called Paris in Ontario within ~100km of London (roughly on the path between
London and Toronto) that has a picturesque appeal, so it's not a totally
unlikely local car tourist destination.

And sitting here in Toronto, with an available-to-google frequent travel
history to London, Ontario, searching "london ON to paris ON" is not enough
for google to know what I mean.

This stuff is hard.

~~~
vidarh
Thanks - I was totally unaware there's a Paris in Ontario too.

I was literally just thinking earlier that at least if someone finds something
to add/correct, it will be awesome because it will just emphasise just how
hard this problem is even for humans.

------
renewiltord
Can get quite far using a CRF solution. But nothing is close to perfect here.

------
YeGoblynQueenne
While I enjoyed the long answer by alvas (well, they're suggesting you can
only hit 100% accuracy with a rulebase; a man or woman after my own heart) I
think the real problem the user asking the question, Merv Merzoug has, is not
as complicated as it's made to be. Specifically, the user wants to parse
successfully only a particular kind of English sentence - not arbitrary
natural English sentences. What's more, those are sentences that are found in
a particular, restricted context, that of strings that record travel details
(most likely for some kind of travel agency and the like).

Given that those strings are going to be on the short side (you don't need to
anticipate someone writing an epic saga in dactylic verse to describe a
flight) the whole set of such sentences is probably finite and relatively
small and can be represented by an equally finite and much smaller set of
rules. So while alvas' comment is spot on regarding the amount of work one
needs to do to cover all the exceptions and edge cases that might crop up even
in such a small subset of natural English, it is not a problem so big as to be
impossible to solve with some elbow grease.

There is much precedent of solving much more complex problems with hand-
crafted grammar rules. For a very recent example, the latest online version of
Magic the Gathering, M:tG Arena, apparently uses a hand-crafted context-free
grammar to represent the language used on Magic cards [1]. The grammar is used
in a parser that executes cards as scripts in a scripting language. Creating a
grammar for the entire M:tG language is a job that a dedicated programming
team can accomplish in a reasonable amount of time but even representing a
small but substantial subset of that language is a goal that is in the range
of, er, let's say ah a degree student of Computer Science in her final year :)

_______________________

[1] Just let me find that link. I got it around here somewhere...

Yep, here we go:

 _Magic Arena 's GRE: The Groundwork for future digital products, in Chris
Clay's Interview_

 _To dispel any misconceptions: the GRP is not a machine learning approach to
Natural Language Processing. It 's more of a compiler/parser that uses rules
and logic we've crafted. We have a dictionary that defines the parts of speech
and some semantic data for "Magic Words". We also have a huge list of syntax
rules for "Magic Sentences", similar to a Context Free Grammar. With this, and
our good friend the Natural Language Toolkit, we generate a parse tree for
ability text. The tree groups individual words into phrases, multiple phrases
into sentences, and all the sentences under the ability as a whole. We then
process that representation of an ability into higher-level ideas, specific to
Magic. “Draw a card” is no longer just a verb and an object, but is an action
and entity with regards to game play. We take verb 'atoms' (destroy, discard,
exile) and use them to construct molecules ("triggered ability", "sequence of
effects", "replacement effect"). With that semantic representation of the
ability, we can then basically compile it into code. That code is then added
to the GRE,and executes when you play Arena!_

[https://www.reddit.com/r/magicTCG/comments/74hw1z/magic_aren...](https://www.reddit.com/r/magicTCG/comments/74hw1z/magic_arenas_gre_the_groundwork_for_future/do0rvfe/)

