
Offer HN: Advice from PhD on NLP, machine learning, and data monetization - bravura
Do you have questions about machine learning, natural language processing, or data monetization in general?<p>In particular, I'll field questions on:<p>* recommender systems<p>* profit optimization in ecommerce sites<p>* autotagging<p>* creative ways that NLP + ML can add value to your product and give you a competitive advantage<p>* whatever else you like.<p>If you have a specific technical question about ML or NLP, please post it on the MetaOptimize Q+A site (http://metaoptimize.com/qa/) and post a link here as a comment. I'll give you the detailed answer over there, since the Q+A there is designed to be archival and is more searchable. Other sorts of questions (like "How do I turn this data into money?") can go in this thread directly. If it seems like a longer discussion, email me at joseph at metaoptimize dot com and we can talk about setting up a Skype chat.<p>p.s. I'm also hiring people for remote project work, if you are kick-ass at ML or NLP, or you simply can ship correct code really fast. Email me at joseph at metaoptimize dot com.<p><i>Who am I?</i><p>My name is Joseph Turian, and I head MetaOp­ti­mize LLC. I con­sult on NLP, ML, and data monetization. I also run the MetaOp­ti­mize Q&#38;A site (http://metaoptimize.com/qa/), where ML and NLP experts share their knowledge. I recently demo'ed autotagging of hacker news to make it automagically browsable (http://metaoptimize.com/projects/autotag/hackernews/).<p>* I am a data expert, hold­ing a Ph.D. in nat­ural lan­guage pro­cess­ing and machine learn­ing. I have a decade of expe­ri­ence in these top­ics. I spe­cialize in large data sets.<p>* I’m business-minded, so I focus on busi­ness goals and the most direct path of exe­cu­tion to achieve these goals.<p>* I am also a tech­nol­ogy gen­er­al­ist who has been hack­ing since age 10 and has pro­grammed com­pet­i­tively at a world-class level.
======
eggbrain
Wow, this is exactly what I have been searching for for months!

I've got a pretty basic grasp on machine learning and collaborative filtering,
so I understand some things, but I'm very confused on others, such as:

1)List-wise vs Pair-wise approaches to machine learning. Can you explain in
simple terms what is the main differences between them, and in what cases it
would be better to use one over the other? I've read a few sources about the
differences but it goes over my head a lot of times.

2)When you don't have many users using your site, from my (basic) knowledge,
you can't really use KNN algorithms to help with recommendations, because you
only have a few people to compare your (lets just say movie preferences) to.
What is the best way, then, to get the best recommendations both when your
userbase is large and small?

Those are just a few off the top of my head, but I'll be sure to add more
later on.

~~~
bravura
1) Pair-wise approaches, if I understand what you mean, are those in which you
are look at pairs of examples. You want to avoid naive implementations, in
which you train on a quadratic number of instances (all-pairs). It really
depends upon the specific problem how you get around this. For example, if you
have a model that operates over pairs, doing example selection in an
intelligent way. For example, use all positive pairs, but sample only one
negative pair per positive pair.

2) You are talking about the cold start problem. Basically, the problem is how
do you deliver good recommendations to new users as quickly as possible. I am
actually currently talking to a big company about this problem.

Google has talked about how they do this for Google News, especially because
they have high item churn (news is not news in a day).
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf)
Essentially, they do clustering of users and news items.

Another approach is to give the user a hunch-like quiz when they join.
Basically, you use a decision tree to cluster the users, getting as much
information from them with as few questions as possible.

Another approach that requires almost NO user interaction is to use some
ambient information about the user. Maybe you have a cookie for the user and
other sites will share the user's behavior with you. Maybe you ask them to log
in through facebook or twitter, and then can look at their profile and
activity history to build a user profile. So getting creative about possible
sources of ambient information is another approach.

~~~
eggbrain
I need some clarification on #1:

Could you provide easy examples where you would use a pair-wise approach, and
a list-wise approach? Like, let's say I wanted to build a recommendation
system for college football pick-ems (Ie, will Iowa beat Michigan State?
Etc.). Which method would be preferred, and why?

Another question:

Let's say (again) that I want to build a recommendation engine for college
football picks. How do I know how much (more) data will help for training
purposes? IE, will only Win/loss records suffice [probably not], or should I
add in the opponent they beat (or were beaten by) as well? What about whether
the game was away/at home? What about factoring in the individual players on
each team, and their stats? What about the temperature that day? How you do
end up knowing when you can stop looking at smaller and smaller variables with
your training data, and have something that gives the best results?

~~~
alextp
You can do greedy feature selection: make a stable train and development and
test (and maybe another truer test set, for use later) sets and define an
appropriate quality measure. Then implement the simplest thing you can think
of, train on the training data, calibrate the results on the development set
(tuning hyperparameters, etc) and look at the performance on test data. That's
your baseline. Now think up of a nice new feature you'd like to add, implement
it, and see if you can get the error on the development set to go down. If it
goes down a lot (you can decide an appropriate threshold, maybe depending on
the computational and human cost of using those features), keep it. Otherwise
throw it away. Now repeat this process for every set of features you can think
of, documenting the combinations you've tried as you go to see where is the
best effort/performance tradeoff. Then test your best model on the test set to
see if the performance has really improved.

If your model is linear, you will probably see diminishing returns as you
implement more, different features. It also helps to look at model errors to
see which features are pulling things the wrong way, and then you can add
other features to compensate.

(feature engineering + linear models feels a lot like writing an old-school AI
heuristic program, except you have a computer program give a numeric weight
the rules you're writing, so you're free to write a lot (millions) of rules
and still get a manageable model)

------
jwr
This is a very general question.

I looked through the feature lists of many e-commerce analytics packages. It
struck me that while they offer impressive-looking graphs and gather huge
amounts of data, the "analytics" part seems to be very simple. They mostly
present you the data that has been measured.

I am in a situation where I will want to provide e-commerce analytics as part
of a more complete solution. But I would like to approach this from a
different angle: provide _digested_ data, with insight into patterns that
could lead to increased profits for online stores.

What do you think could be extracted from e-commerce data that would be of
value to online stores? I'm looking for information that is difficult to
extract (e.g. requires a tech advantage), but can be extracted automatically
or with minimum human curation.

An example of an interesting direction: e-commerce analytics companies offer
"customer segmentation" views. But they are all just simple data views: you
can segment based on location, time of visit, etc. I think doing a factor
analysis on customer buying patterns would be much more interesting and would
provide real value to the store by automatically extracting "customer types"
from data. One could then market directly to these customer groups.

So, to rephrase, I'm looking for ideas/pointers on what kinds of analyses
would be difficult to create, but very useful for e-commerce.

~~~
noelwelsh
I'm agreed that most so-called analytics has little value. Your customer
segmentation idea is a good one. You could also approach this as a
classification problem, where you attempt to classify customers as purchasers
or non-purchasers for each product or product group. Another idea is to look
at high and low value pathways through a site. If you can identify pathways
that don't lead to purchases that might suggest where design effort should be
spent.

------
syllogism
I might also answer some questions here. I'm a post-doc in NLP, but more from
the linguistics side of things --- less business focussed, and less of a
hacker.

(Hi Joseph --- we met briefly at ACL. I was one of James Curran's students. My
name's Matthew Honnibal.)

------
random42
Any good materials to read? I work in a text-analytics startup, so I do have a
basic understanding about NLP and related fields.

~~~
icco
Also, any good materials for those of us who have no background in NLP besides
basic computer science theory?

~~~
bravura
_Good materials for those of us who have no background in NLP besides basic
computer science theory?_

My main proposal is to post a "How do I build X" question on MetaOptimize. (I
don't know any place on the web where experienced NLP people chat, besides
MetaOptimize, but I would be interested to hear other options.) NLP isn't
really very hard once you know what you're doing, it's just that as an
inexperienced person you will spend a lot of time going down dead ends and
blind alleys, and there are a lot of pitfalls to avoid. So getting the advice
of experienced people, right from the beginning, is the best way to do this.

See also:

* What are the best resources to use when starting machine learning for an experienced programmer? <http://metaoptimize.com/qa/questions/334/>

* How do I get started understanding statistics & data analysis? <http://metaoptimize.com/qa/questions/154/>

* New to data mining - where to start? <http://metaoptimize.com/qa/questions/362/>

------
tomh-
I posted two questions.

[http://metaoptimize.com/qa/questions/3084/what-components-
do...](http://metaoptimize.com/qa/questions/3084/what-components-do-you-need-
to-build-something-like-google-news)

[http://metaoptimize.com/qa/questions/3085/which-
techniques-i...](http://metaoptimize.com/qa/questions/3085/which-techniques-
in-named-entity-recognition-offer-the-best-results)

thanks!

(ps please don't upvote me, I like my 1337 karma)

~~~
bravura
I just posted answers for both your questions.

~~~
tomh-
Thanks! Have some reading to do :)

------
oakmad
Here is my question: [http://metaoptimize.com/qa/questions/3086/what-is-the-
best-w...](http://metaoptimize.com/qa/questions/3086/what-is-the-best-way-to-
break-into-an-nlp-career)

For MSc in Computer Science I wrote my dissertation on sentiment analysis
using NLTK for my NLP needs. Although my work was nothing earth shattering, I
really enjoyed it and have continued it as a side project investigating
different techniques and algorithms. I'm now considering ways that I can turn
this interest into my full time career (I'm a .Net developer currently). I
don't particularly wish to send out a lot of unsolicited resumes and so was
wondering if anyone had some advice on who best to approach? Are there any
recruiters and the like that specialize in NLP?

Thanks

Adam p.s. I am an US permanent resident and an Australian Citizen, the website
that has my details and some example work is www.emptyorfull.com

~~~
nl
Write an open source library.

I still get contacted about a ML library I haven't touched for 5 years.

------
retroryan
What are some of the unique business models around data monitization? The most
obvious are pay walls and generating content to drive advertising. Any
thoughts on how successful paid access to data is, like with imdb pro?

~~~
bravura
I don't think people really want to pay for data.

I also think that people don't like having ML or NLP tools through a SaaS API,
because you want a tool that you manage on your own hardware.

I believe that we'll see something I call the "machine learning business
model". Essentially, you release your code under open source, and you give
away a basic model, but then you charge for a premium model. The model is
essentially like compiled code. No one knows how you trained it up, or why it
works. So you get people using your tools by open-sourcing them, and then you
upsell them on increased accuracy.

------
RiderOfGiraffes
Clickables:

\+ <http://metaoptimize.com/qa/>

\+ <http://metaoptimize.com/projects/autotag/hackernews/>

------
eel
I am a first semester grad student seeking an M.S. in computer science. I am
interested in NLP and hoping to choose it or a subfield of it as my research
topic. Do you have any advice? For instance, is there background knowledge
that I should be learning that would be helpful? What about good books or
important papers? Other tips for CS/NLP grad students?

I know this is a rather general question, but any advice that you might have
would be useful to me and students in similar situations. Thank you!

~~~
Tycho
Hopefully the OP will be able to answer your question, but I thought I might
offer my 2 cents:

I'm currently working on an MSc in IT. My actual project does not involve NLP
directly, but all the 'related research' does (ie. all previous systems which
achieve the same goal as mine)so I had to read up on it. This was a
struggle... YMMV. I find a few good resources along the way though - recommend
reading this first
<http://www.staff.ncl.ac.uk/hermann.moisl/ell236/manual1.htm> then on
iTunesU/YouTube you'll find a good course on computational linguistics from
MIT, watch the first half dozen lectures at least (don't worry if you don't
fully understand them all). A lot of the theory can be traced back to
Chomsky's work on grammars, so I suggest reading some of his papers (his style
is quite clear). I found I needed to do all this just so I could properly
follow the recent papers on NLP (specifically NL interfaces to DBs) that I
must review. I really had to start from the beginning, light wikipedia'ing
wasn't cutting it. I'm currently reading The Oxford Handbook of Computational
Linguistics to consolidate my knowledge, it's pretty good.

So anyway... tough subject, try not to get bogged down. If all the reading's
too dry, then 'Godel Escher Bach' is a fun alternative that touches on similar
theory.

If I were to choose a pure NLP project myself, I'd maybe do something like
'How NLP can add value to existing systems/software.' ie. look for imaginative
practical applications, rather than theorizing yet another parser with some
new trick/quirk

------
acconrad
Can I get to you by email? I have a question on data monetization.

------
bluemetal
Ending my 2nd year of a CS degree and I have enough technical knowledge to
begin coding up websites and services like many of the other people here on
HN. I was wondering whether or not it would be better to run ads on these
sites, or to just try and generate large/high quality datasets and then sell
those. How does someone like me monetize their data?

~~~
bravura
This question is far too open-ended to answer.

I would avoid a ad or data-selling business upfront. I would focus on a simple
transactional business (i.e. they give you money and you give them a product
or a service for a month), and learning about your market and why they care
about what you're offering.

~~~
iworkforthem
Are you able to provide examples of it? i.e. stock exchange sell data feed,
etc.

------
apurva
One question that definitely has plagued me for a while is that of evaluation
schemes of recommender systems... On multiple occasions I have designed
recommender systems that give a fairly good user experience, but I just have
not been able to quantify these results... any pointers on that will be really
appreciated.

~~~
ogrisel
xlvector has a blog on recsys that talks about a good but overlooked feature
of those systems from a user point of view: the ability to have the user make
fortunate discoveries while looking for something seemingly unrelated:

\+ <http://xlvector.net/blog/?tag=serendipity>

I have no easy answer on how to quantify that myself but I guess the papers
quoted by xlvector must be a good place to start.

------
hamilcarbarca
Looking at things from a legal standpoint, how do you go about creating your
database? What are you must conservative about when using APIs for data
mining?

How do you approach social websites without APIs about scraping their data?

What criteria do you use to determine whether an algorithm you are developing
qualifies for legal copyright protection?

------
flannell
Do any of the principles of ML be applied to KPI / balanced scorecard
applications? I have done some work with calculating simple linear regression
but would like to move towards decision tree learning. Is this a good idea? or
is there something more applicable? BTW, I'am a noob at this! Thanks in
advance.

------
sthatipamala
I'm working on a shopping startup and need to get a basic recommender system
built in the next few weeks. So far, I've implemented SlopeOne to determine
the order of results every time a user searches. Is there any way to
cache/memoize this repeated computation? Also, is SlopeOne the best way to go
for this?

~~~
helwr
have you tried DirectedEdge?

------
elqueso
What's the best way to normalize product names? Specifically, I would like to
be able to take a recipe and get all the ingredients in that recipe. The
ingredients are going to be written by humans so I need the ability to
understand that ingredient and map it to something normalized that I know how
to deal with.

Thanks

------
Wilfred
I've built a little morphology parser in my spare time, and I'd love your
thoughts.

[http://metaoptimize.com/qa/questions/3103/morphology-
parsing...](http://metaoptimize.com/qa/questions/3103/morphology-parsing-with-
an-esperanto-focus)

------
yesimahuman
My question about improving NER results, or finding a better library for
tagging: [http://metaoptimize.com/qa/questions/3092/improving-ner-
resu...](http://metaoptimize.com/qa/questions/3092/improving-ner-results)

Thanks!

~~~
syllogism
If you're considering implementing a tagger, then I think this is a good paper
to read for approaches that will improve your results in a developer time
efficient way:

<http://l2r.cs.uiuc.edu/~danr/Papers/RatinovRo09.pdf>

~~~
ogrisel
Thanks for posting this: I read this paper some time ago but completely forgot
the name of the authors.

It is indeed a good "tricks of the trade" paper. I real enjoy such papers.

~~~
syllogism
Yeah, it's nice when a paper boils down a task into the best bang-for-buck
approach you should take.

A recent such paper I'm excited about is this one on coreference resolution
from EMNLP 2010: <http://www.aclweb.org/anthology/D/D10/D10-1048.pdf> , by the
Stanford group. It's a rule-based coreference resolution system you could
implement in a couple of days that gets state-of-the-art results. Best of all
it looks like you could get it to run about as fast as the parser that's
backing it. I'm planning to write a version for the C&C parser.

------
braindead_in
My requirement is that I need to build a phrase extraction system. We have a
corpus of text and we need to find the most relevant phrases from that corpus.
Is there a simple algorithm/library I can use? Thanks in advance.

------
sciboy
What is the single best result you have had applying your techniques. I don't
care about who it was, but would love to know details about how much of an
impact this type of work has in the real world.

------
nightlifelover
Heh data monetization sounds good.. what about skill monetization? I'm getting
a MS in computer science but am unsure what to do with it..

I'm particularly frustrated with brainless Java / .net development jobs.

~~~
ogrisel
Learn hadoop / lucene / pig / hive / clojure / python / ruby and put your code
on github and if you can digest all of these you will get plenty of
interesting job offers.

~~~
pragetruif
What are ways for someone without access to a Hadoop cluster (and doesn't want
to continually pay for, say, Amazon's Elastic Map Reduce) to learn
Hadoop/Pig/Hive? I'm a stats/data analysis guy, so I've been wanting to get
into big data for a while, but I'm not sure how to progress beyond reading a
bunch of tutorials (which I have).

------
geezer
My question here:

[http://metaoptimize.com/qa/questions/3096/how-to-choose-a-
su...](http://metaoptimize.com/qa/questions/3096/how-to-choose-a-supervised-
learning-method)

------
ntoshev
What are the biggest challenges you meet in terms of missing software
infrastructure to support your NLP/ML work? Things you wish someone would
implement?

~~~
ogrisel
Something that is really missing in my opinion is a good opensource OpenCL [1]
runtime for regular x86 CPUs using both muticore optims (a la openmp) and
vector instructions (SSE). Bonus points if your OpenCL runtime can also
leverage the nouveau and/or the open source ATI drivers.

Having such a runtime available as a standard lib packaged in all linux
distribs would make it interesting for lowlevel vector math libs such blas /
lapack / atlas and convex optimization solvers to have an implementation based
on OpenCL kernels that would work (almost) as fast as the currently manually
optimized C + simd intrinsics code but could also 10 to 100 as fast whenever a
GPU is available on the machine without having to re-compile anything.

Some advanced and very promising machine learning algorithms (e.g. algorithms
of the deep learning family) can really benefit from the computing power of
vector processors such as the GPUs.

Right now everybody who wants to be able to gain the perf boost of GPUs uses
CUDA but: \- it's only useful on NVidia GPU machines (e.g. not on Amazon EC2
machines for instance) \- it's not opensource hence has to be manually
installed on every single machine you want to use it (no apt-get install from
the official repos). This which makes it a heavy dependency for your pet
machine learning library: be prepared to support CUDA installation problems on
your project mailing list.

[1] <http://en.wikipedia.org/wiki/OpenCL>

~~~
ntoshev
If you like Python, Theano may be useful to you (haven't actually tried it):

<http://deeplearning.net/software/theano/>

 _Theano is a Python library that allows you to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently.
Theano features:

\- tight integration with numpy

\- transparent use of a GPU

\- symbolic differentiation

\- speed and stability optimizations

\- dynamic C code generation

\- extensive unit-testing and self-verification_

~~~
ogrisel
Yes I know and use theano for some deep learning experiments. This is really a
great tool. But theano will probably never be considered a default dependency
for common machine learning libs as long as there is no good OpenCL runtime
opensource and pre-packaged in major posix distros (linux and bsd, osx already
has opencl by default but not opensource hence cannot be reused in linux
AFAIK).

------
acconrad
Ok I guess my last question wasn't well-worded, so a better question:

What do you recommend reading to learn about data monetization basics?

------
jlees
What in your opinion is the best and most extendable open source toolkit or
library for machine learning?

NLTK? Weka? Other?

~~~
bravura
It really depends upon the task. For each task (e.g. POS tagging), I maintain
a list of current open source implementations. When I attack a problem that
involves this task, I weight the tradeoff between how easy the tool is to use
with how accurate the tool is. Most of the time, I favor ease-of-use. If I
need high-accuracy, I start hand-rolling my tools and keeping them around for
later.

p.s. I don't actually use NLTK or Weka that often.

------
jonmc12
What kind of data and applications are best suited for monetization using data
mining? Now? In 5 years?

------
brosephius
thanks for doing this! my question:
[http://metaoptimize.com/qa/questions/3089/how-far-can-you-
ge...](http://metaoptimize.com/qa/questions/3089/how-far-can-you-get-in-nlpml-
without-a-graduate-degree)

------
lenley
Do you know of any ML / NLP libraries in erlang?

thx.

------
korch
How can the present state-of-the-art in NLP ML be more than a toy having
limited scope if the most interesting, and truly useful real-world
applications require impossible amounts of hardware resources, along with
algorithmic running times that will take until the heat death of the Universe?

Mainly I'm talking about the topics in and related to the famous paper _What
Computers Can't Do._ <http://en.wikipedia.org/wiki/What_Computers_Cant_Do>

Before I get nitpicked for generalizing NLP/ML to all out "Hard AI", I would
like to note that _almost all_ of the most trivial language processing tasks
we _meatsacks_ perform a billion times per day in our daily lives involve the
_semantic interpretation of abstract representations_. It would be trivial to
sit down and begin enumerating specific types of word and association problems
that a child could perform, but which no supercomputer with a 1,000 engineers
toiling it on could solve in the general sense.

By no means am I saying that everyone in the field should not full court press
ahead on the state of the art, as we'll never get there unless we try. I am
asking about the extent that real-world NLP/ML use is constrained enough so as
to be nearly worthless hokum buzzwords. (i.e. analytics, recommendation
engines for pre-conditioned, overtrained and context-free representations,
behavioral profiling for ads, statistical models of dynamical systems, etc.)

I hope this doesn't come off as being negative, as I'm really more interested
in the question of Hard AI and how it relates to computational linguistics
problems we can solve today or in the near future without some kind of
unforeseen "singularity" level break through in AI. Or is the general state
still as bad as when Minksy demolished perceptrons, causing the "nuclear
winter decade" in AI?

I don't have a PhD, just a pure math undergrad with a high motivation to have
kept going at it, and this topic itself is so obscure & difficult that it's
not often one has the chance to ask someone who knows, so I am very interested
in the perspective of present day researchers.

