
Things I Learned Building The Most Powerful Language Processing Engine - drakaal
http://www.xyhd.tv/2013/04/industry-news/things-i-learned-building-the-most-powerful-language-processing-engine-on-the-planet/
======
gliese1337

      Linguists don’t know squat about grammar in modern times. Everything is a verb these days. I Google things. I FedEx things. My game gets nerfed.
    

That's a bit broad. Talk to the right kind of linguists, and yes, they do know
these things.

Every problem he talks about is well-known. Not to say that they aren't still
problems- they are problems, and hard ones, that linguists (computational and
otherwise) don't have good standard solutions for. Thing is, most of the time,
you can get away with not bothering to solve the real problem. Most of the
time, you can pretend that white-space separates words and periods separated
sentences and that 8 parts of speech is good enough, and it will be good
_enough_. And thus the incentive to spend lots of time and research money on
solving all of the big problems in natural language processing is reduced.

Except when it isn't _good enough_ , and we wonder "why the heck hasn't this
been solved yet? Doesn't anybody realize that this library is totally broken?"
But yeah, we know that it's broken, we know what the problems are, they're
just really frickin' hard.

~~~
drakaal
You are right there are a select few. But it is very few. I linked to the
quora as just one example. The challenge is that they focus on what is "right"
not what is "common".

It is like swum. It is a real word, and I have swum many times in my pool, but
it is not how humans write. More of practical linguistics is about how people
use words, not how they are supposed to use words.

~~~
curiousdannii
Linguists get descriptivism drilled into them from day one! If anyone who goes
on about what is "right" I would highly doubt they were trained as a linguist.

You could perhaps instead mean "standard" instead of "right" - that they're
describing a standard dialect, which doesn't meet reality. That could be true,
I'm sure it is in places.

Btw, I'd be interested to know how many parts of speech the CGEL describes.
Anyone got easy access?

------
gavinh
I've been looking at stremor.com to find some justification for your bold
claims. I'm still looking; these are some ancillary points:

-The copy throughout the site does not inspire confidence. Your point that many texts are written poorly is valid. That does not require you to also write poorly. I've seen far too many sentence fragments on your site. You site also includes some embarrassing misuses of words, like "Some may believe using heuristic science in language analysis infers it is a learning system." These incidents discredit your claims about your technology.

-I am having difficulty finding specific information about how your technology works.

-The people responsible for your graphics, visual design, and the video about the summarization app should not have those responsibilities. Summly had nice a aesthetic; your aesthetic is jeopardizing your credibility.

I apologize if my comments sound hostile. When you make claims as bold as
yours you should prepare for scrutiny. The problems you are addressing are
interesting; I wish you good luck.

~~~
drakaal
You judge the technology based on the marketing? Summly had no tech and lots
of marketing. SRI had no marketing and lots of tech. I'd rather be SRI than
Summly. Take note as to which one got a Billion dollar valuation.

~~~
gavinh
I don't judge your technology based on your marketing; as I mentioned, I
looked for specific information about your technology but was not able to find
any. The .pdf you linked is what I quoted above; I'm interested more in your
papers or patents.

SRI is a respected brand with a great reputation and history. I wouldn't be so
dismissive of marketing since, as you mention, you have recently exited
stealth mode.

~~~
drakaal
We started 14 months ago. Papers and Patents take a bit longer than that to
get through. We have two pending patents. Much of the "magic" is trade secret
rather than patent because the changes in patent law are now requiring enough
information for you to build the tech. We don't have the money for
enforcement, so making that much IP public scares us, especially if someone
like Google decided that the patent was "obvious". So we are balancing our
portfolio of IP. Enough to be worth acquiring, not so much that you could dupe
the work with the part that are in the public record.

~~~
gavinh
I didn't realize that you had started so recently. I appreciate that patents
and papers take time to produce.

I am trying to be helpful when I say that for a company that produces language
processing technology, your customer-facing content shows a strange disregard
for language. If your writing does not demonstrate that you have mastered
English grammar, why should I believe that your software has?

~~~
drakaal
The challenge with the website is saying things in such a way that we don't
scare off the non-technical, and still appeal to the technical.

All cards on the table, there is a lot of discussion internally about who we
are writing for. The result is we get a mish-mash of copy from Engineering and
Marketing.

The PDF I linked to is much better because those were targeted to Engineering
managers. The website still doesn't know who it is targeting.
<http://www.tldrstuff.com> knows its audience it sucks much less.

~~~
jorah
Simple, clear, and grammatically-correct writing will not intimidate non-
technical users.

------
bane
"200,000 words. Ha! 400K words. I laugh. 3.2 Million words. I still know I am
missing stuff. Single word nouns in just the singular form exceeds 150k. 40k
verbs and conjugates. 37k adjectives. 10K adverbs. I know I am still missing
things."

I was all set to hate this post, but I found I ended up largely agreeing with
it, especially the quoted bit above. I remember working on a very focused NLP
tool a few years ago and needed a comprehensive English Lexicon. No problem,
I'll just scrape WordNet or similar, not even close. Then you start dealing
with stemming and conjugations and such and realize that almost all of the
algorithms for dealing with this kind of thing would barely even be hacks in
software terms, yet there they sit, regurgitated in countless libraries,
generating garbage stems all over the place. It ends up just being easier
collecting all the stemmed forms as well and just building some smart in-
memory indexes and data structures for searching millions of words.

Vector space models? Why do they work? Nobody really seems to know! Just jam
all your words into some matrixes and run some simple calculations and voila
you get something seems to kind of like it sorta works some of the time.

Sentence tokenization is stupid hard, but shouldn't be, parantheses have all
kinds of different meanings, commas are a mess...English is stupid.

The worst bit though really is that most of the research->turned into software
assumes astonishingly brittle models about the language which almost never
seems to describe any _actual_ usage of the language which always means very
frustrating _almost_ right results out of NLP systems. My previous sentence,
for example, would cause most NLP software to blow a gasket.

~~~
drakaal
I ran it through our stuff we do fine with segmentation of your comments. We
still have trouble with certain poorly formatted numbers, or if you do
something like for get a space after a sentence that ends in a number. "I'll
take 2.In case I need one later."

Typos are really hard for rules systems.

PS Glad you didn't hate it.

~~~
mchaver
Aren't you arguing against prescriptivism and not linguists (as in people with
a background in Linguistics)? My background is Linguistics, Computer Science
and NLP. From what I was taught, and what seems to be the norm, is linguists
tend to do descriptive linguistics. You learn from your data and change the
rules and methods accordingly, not the other way around. Anyway, interesting
read.

I am curious about one thing. Are you just handling white spaced words, or do
you handle lexical items as well such like polywords (inside out), phrasal
verbs (put up with, beat up), idioms, etc.?

~~~
drakaal
We handle polywords. We had to. Noun adjuncts being one of the worst in terms
of how often they show up, and I'll be honest I don't even know what you call
Gerunds as Adjectives like "Running shorts" but those are a pain too. But you
can kind of find ways to get lists of those. Parts catalogs, shopping sites.
People want to buy polyword nouns. There are fewer of the verbs but they are
harder to find lists of.

Idioms we haven't addressed much. Fortunately in news they come up less than
other things. Sarcasm we deal with in most cases, but not all. There are hints
that things are sarcastic, but if they are written like "the Onion" and they
don't change tone because they are really a parody rather than sarcasm we
can't tell that.

I like to say that if your 5 year old would figure it out we will figure it
out. If your 5 year old won't probably we won't either. But this is a huge
leap forward since the competition is about the level where your dog would
understand.

------
Groxx
Interesting blog post, though I wish they would give us some meat rather than
what's essentially a rant. I do find it a bit amusing that it doesn't perform
well in their own system:

> _I am the CTO of Stremor, we make TLDR Reader. Sentence objects
> Sentenceobjects represent information extracted from a single sentence
> within a document. Attributes and methods available on Sentenceobjects: ∗
> text: The raw text of the sentence as a string. ∗ names: A list of all names
> detected in the sentence._

Yeah, this is similar to picking at Google Translate not translating Google
into the same text as they use on their other-language homepages. I'm honestly
not complaining, I just found it funny :) To be fair, it does a significantly
better job at other blocks of text I've thrown at it. Not great, but
surprisingly good - I'll have to prod it more.

~~~
drakaal
As I mentioned in comments, I didn't mark up my "code paste" so it decided the
documentation was more important than my comments.

The TLDR version is optimized for news at up to 5000 words. We have some other
stuff for Fiction but it isn't public, and a version that is specific to
politics.

~~~
Groxx
I'll aim to feed it some 5k articles, thanks! Definitely interesting, but it's
such a tease of an article :/

------
akavlie
One of the Stremor devs here. I'm helping to build Liquid Helium, and wrote
the cited README. Never expected that it would make its way to a public blog
post :-).

Yes, language processing is hard. There are two challenges here:

1) Understanding the ambiguities of language, when every word in the sentence
can be 2+ parts of speech.

2) Making it fast, and making it fit within the relatively constrained RAM
limits of App Engine instances.

We're wrestling with both while we greatly expand on what Liquid Helium can
do. It's not easy, but some of the things we're able to do with it are pretty
magical.

~~~
drakaal
To akavlie's Ram limit comment: We run on Google AppEngine. It gives us great
scalability, but it limits us to about 512 Megs of ram.

We made this choice because with a small team it gave us the most freedom to
focus on the code, rather than the scalability and the infrastructure. We
don't manage servers, or routers, or load balancers.

~~~
aidos
I was going to ask about that, seemed strange to feel that a self imposed
constraint was a challenge you had to work around.

There are other options outside of app engine that could help you achieve the
same thing. I outsource most of my processing to picloud (specifically because
I need more ram than I have).

I don't work in your field but I think it's a really interesting problem
you're working on.

~~~
drakaal
If you are going to compete with Google you might as well step in to the arena
running on the same hardware. :-)

Also the best targets for acquisition put Google high on the list so building
things the way they would makes sense.

------
kurumo
What precisely makes this particular engine 'the most powerful in the world'?
Does it do domain independent named entity recognition with an F score better
than 0.8? For what classes of entities? Is it at least adaptable without
oodles of training data? Does it do syntactic parsing? With F scores of 0.9 or
better? Faster than 200ms per sentence? Across domains? Does it do anything at
all in languages other than English? If there is a page on that site where it
answers these types of questions I couldn't find it..

~~~
drakaal
Install the TLDR plugin. Pick a web site. Or better yet go out to project
gutenberg pick a book. Tom Sawyer. Push TLDR. Way faster than 200ms per
sentence.

Yes it does most Germanic and Romance languages.

Yes it does domain independent named entities with a a higher score than
anything else on the planet. ALL English classes. Medical, Dental, Animal.
(that doesn't include Latin uses of animal names) Technical.

As I said we are just stepping out of stealth. I linked a PDF in the comments
here.

~~~
kurumo
Thanks, that's somewhat helpful. I am not particularly interested in the
summarizer plugin itself (mostly because we have one, built in house), but I
would love to talk about the underlying pipeline. If you have e.g. a named
entity recognition library that performs as well as you say in Romance
languages on standard data sets, you have material for at least one conference
paper, and furthermore a product much more valuable than the summarizer
itself.

My question about speed referred to syntactic parsing specifically. I am sure
you can do entropy scoring faster than 200ms per sentence, but unless you have
access to parses you are unlikely to be able to do more than purely extractive
summarization. That's what Summly does, and every other summarizer on the
planet as well. (Except perhaps Columbia's Newsblaster, but that's a bit of a
different story).

~~~
drakaal
We do extractive summarization because we don't feel that changing the authors
words is fair use. We could do rewriting. We actually have an in house demo
that for lack of a better word build Wikipedia pages for animals. (animals
have fixed traits so it is easier than if we were to try and do general people
and the information on them changes much less frequently)

I don't have time to do conference papers.

Our pipeline requires almost every one of our capabilities in order to do
TLDR.

We have to grab the page. We have to separate the content from the theme. We
have to convert the HTML to a not HTML "thing" that lets us work on the text
but maintain the HTML. Then we have to Disambiguate/Segment the sentences.
Then we have to analyze the type of content to pick how we are going to
summarize it, which requires all the noun, and stemming and keyword analysis,
then we have to rank the sentences in importance based on concepts and
causation, and readability, and emotion. Then we have to put all the HTML
back, and present it to the user.

We set the goal that Tom Sawyer can't take more than 45 seconds to run.

~~~
kurumo
Fair use or not, if you could do it I would buy it :) Fine, forget conference
papers. If you can demonstrate fast NER in multiple languages, across domains,
with competitive precision/recall metrics, I will buy it. The rest of it is
not particularly interesting to me because it's frankly not that hard.

------
zwegner
Aside from the NLP-related aspects, which are pretty interesting by
themselves, I was glad to see this:

> The biggest thing I learned. The thing I also hope my team has learned.
> Everyone else has hit the limits of what they can do because they weren’t
> willing to burn it all to the ground and start over. We start over a lot. We
> code for 3 days and then decide this won’t work, and we do it over again. We
> take the lessons we learned but little of the code. The second, or thrid
> time we do it right based on what we now know.

I am a big fan of rewrites, and am sad to see how much of the software
engineering world has centered around the Spolskyesque idea of rewriting being
a horrible waste of time. Truth is, 99.9...% of software just sucks. The more
and more infrastructure we build up around it, the more constrained we become.
Just look at the most common languages/tools/technologies we use on a daily
basis. C++/Java/JS/Python? These are just awful. Hell, I'm even like Python
compared to most languages, but there's still so much historical cruft, and
it's ridiculously slow to boot.

I think this is mostly because people just suck at writing software in
general. Rewriting in the industry usually doesn't buy you anything since it's
the same people writing it, and they're still constrained by all the other
software that they interact with. But if everyone was more willing to rewrite,
focusing on code quality, I believe our tools would be less intertwined, we
could achieve much faster rates of progress, and the life of a software
engineer would be a lot more tolerable.

~~~
drakaal
I fixed my misspelling of third...

I think for us the biggest thing is that often we don't know what approach
will work best until we try, and often we have to balance more than just
readability or performance. Memory usage issues meant that we prototyped one
way for getting all the words in to memory, proved that we had enough
information about the words for 99% of what we were going to do, and that we
could get the other 1% later, and then re-wrote with a different "loader" and
different information about the words loaded.

I like functional programming but it made sense to make parts of the code
object oriented not just for readability or ease of programming, but because
it was more performant.

So our rewrites are partly for performance, partly because the end goal moved
slightly, and partly for maintainability. As we stack on new features often
the way we should do things changes drastically.

~~~
zwegner
> I think for us the biggest thing is that often we don't know what approach
> will work best until we try

Absolutely--and I think pretty much all software is like this.
Engineering/optimization is, in a vague sense, all about how much shit you can
throw at a wall and how well you can tell what sticks. The faster/easier/more
accurately you can do this, the better. If you're going to be bound to the
first pile of shit you throw due to business constraints or whatever, you're
not likely to end up with something very nice.

------
jbg4
When smart people get cocky about the hard work they've done, it gives me
great confidence in their claims. Cockiness for smart people is reserved for
only such occasions when the problem has been solved and tested, born into a
growth state, ready to evolve. I am following this story with great interest
because I'm rooting for the creative geniuses at Stremor and for the
technological advances they are producing.

------
homosaur
"I am sure I will have a lot of mistakes for the grammar Nazi’s to point out
in this post."

I'm hoping this was a subtle and brilliant joke and not just a typo.

~~~
drakaal
In a VC pitch you always include a small easy to fix, minor issue. One you
know the answer to. That way they point it out, you tell them how smart they
were for mentioning it, and then the solution. They feel like they proved you
aren't perfect, and you avoid them poking hard enough to find your real
issues.

------
zomgbbq
It would be cool to be able to download an SDK and build experiments with this
without having to email sales and engage in a licensing agreement for the
software first. I've always appreciated the fact that SaaS products like
parse.com, twilio.com, and stripe.com have a low barrier to experiementation
and probably lends to the reason why so many solutions today use their
technology.

~~~
drakaal
We are working on getting it in to the Azure Data Market place with a free
monthly tier of a certain number of API calls. Microsoft is being slow (more
than 5 weeks). We are looking at Mashape but we have not heard good things
about their uptime, and their accuracy in billing.

As a developer I feel your pain. Balancing our building, business concerns,
and support for an API is hard. We are a small company and are doing our best.
But if you do email sales we will make sure to get you access to the API as
soon as it is available.

------
drakaal
We were mostly flying under the radar. But with two companies doing
summarization being bought for $30M in the past month we are becoming less
stealth.

I know it sounds like a bold statement, but I believe it to be true. Doing
things right required we build our own tools, and not rely on libraries from
third parties. I think we benefitted a lot from that philosophy.

------
anigbrowl
I installed the Chrome plugin with interest, but it only seems to work on the
stremor.com page. However, for all I know it's in alpha/beta stage.

Like the general approach, which looks very promising.

~~~
drakaal
Did you restart Chrome after? It should work everywhere.

~~~
anigbrowl
And now it does, I'm happy to say.

------
esperluette
I can't be the only person who wanted to write this:

tl;dr

~~~
drakaal
We have an app for that. <http://www.tldrstuff.com>

The "short" 25% view is pretty good on this. The summary doesn't read as well
because it picks up the ReadMe because I didn't mark it as code snippet
because I'm sucky at WordPress editing.

------
jorah
I am supposed to buy tools for analyzing writing from people who cannot write?

~~~
drakaal
People become editors at book companies not because they can write, but
because they know how others should.

I'm confident few of the Engineers for F1 racing are spectacular drivers.

I know my PE teacher couldn't touch his toes.

~~~
jorah
If you can determine whether someone else's writing is good, it follows that
you can determine whether your own writing is good.

Engineering race cars well is distinct from driving race cars well.

If your PE teacher were capable of touching his toes, he would discredit
himself by choosing not to.

------
jaytaylor
Any chance this project will ever be open-sourced?

~~~
drakaal
Parts of it maybe. But as a whole not likely. Much like SRI we are hoping that
we will license the tech to companies. We could build full products, but we
think that the way this can be best applied is as enhancements to other
products. Search, Technical and Customer Support, Content Authoring, News
Aggregation.

We also have tech that could vastly improve book and movie reccomendations
based on the content and themes of books, not just "did someone like you like
this" kinds of systems.

~~~
sinkasapa
So I guess whatever scientific insights you've gained that make you claim that
linguists know nothing about modern English will remain unknown. At least the
linguists publish.

It is too bad because apparently we wouldn't need to study language and
cognition any more if we had access to your work. I mean, why bother with
neurolinguistics or anything having the stink of biology or psychology? Brain
mapping projects and experimentation have nothing to tell us now that we know
that there are 3.2 million words in English. That has been the burning
question after all.

You've made some progress working on the most widely and deeply studied
language in the world with a richness of written data available that is
unprecedented. But the linguist working to record, preserve and study a small
Amazonian language that works nothing like English is probably just an idiot
because they haven't figured out how language works yet.

Do you even know what linguists are?

~~~
drakaal
I can rephrase. Google releases their billions of NGrams. They did so because
they knew that it would scare anyone from competing. Not because they thought
it would help. They aren't that nice.

Will I ever share? Possibly. But like Google it will likely be when I have a
competitive advantage. I don't not share because I'm mean. I keep things close
because as someone on the thread mentioned that they are competing if I
published all the information you mention, then Google would build what I have
built and we'd be out of business.

Those brain mapping projects aren't done purely for the good of mankind they
are trying to make money and get funding. My goals are the same.

~~~
sinkasapa
I wasn't implying that there was anything wrong with making money, just that
if you're claiming that you're leaving the scientific study of language in the
dust and insulting the people working in that field, it seems a little strange
at the same time to not publish your insights. If you're competing with
scientists, you kind of have to produce science, right? You made great
software but why insult people working toward the understanding of language?
Be more specific with what particular linguists you're talking about if you
have a beef with someone. Why insult an entire discipline if you aren't even
taking part in the discussion by publishing?

