Hacker News new | comments | show | ask | jobs | submit login
Offer HN: Advice from PhD on NLP, machine learning, and data monetization
112 points by bravura 2166 days ago | hide | past | web | 63 comments | favorite
Do you have questions about machine learning, natural language processing, or data monetization in general?

In particular, I'll field questions on:

* recommender systems

* profit optimization in ecommerce sites

* autotagging

* creative ways that NLP + ML can add value to your product and give you a competitive advantage

* whatever else you like.

If you have a specific technical question about ML or NLP, please post it on the MetaOptimize Q+A site (http://metaoptimize.com/qa/) and post a link here as a comment. I'll give you the detailed answer over there, since the Q+A there is designed to be archival and is more searchable. Other sorts of questions (like "How do I turn this data into money?") can go in this thread directly. If it seems like a longer discussion, email me at joseph at metaoptimize dot com and we can talk about setting up a Skype chat.

p.s. I'm also hiring people for remote project work, if you are kick-ass at ML or NLP, or you simply can ship correct code really fast. Email me at joseph at metaoptimize dot com.

Who am I?

My name is Joseph Turian, and I head MetaOp­ti­mize LLC. I con­sult on NLP, ML, and data monetization. I also run the MetaOp­ti­mize Q&A site (http://metaoptimize.com/qa/), where ML and NLP experts share their knowledge. I recently demo'ed autotagging of hacker news to make it automagically browsable (http://metaoptimize.com/projects/autotag/hackernews/).

* I am a data expert, hold­ing a Ph.D. in nat­ural lan­guage pro­cess­ing and machine learn­ing. I have a decade of expe­ri­ence in these top­ics. I spe­cialize in large data sets.

* I’m business-minded, so I focus on busi­ness goals and the most direct path of exe­cu­tion to achieve these goals.

* I am also a tech­nol­ogy gen­er­al­ist who has been hack­ing since age 10 and has pro­grammed com­pet­i­tively at a world-class level.

Wow, this is exactly what I have been searching for for months!

I've got a pretty basic grasp on machine learning and collaborative filtering, so I understand some things, but I'm very confused on others, such as:

1)List-wise vs Pair-wise approaches to machine learning. Can you explain in simple terms what is the main differences between them, and in what cases it would be better to use one over the other? I've read a few sources about the differences but it goes over my head a lot of times.

2)When you don't have many users using your site, from my (basic) knowledge, you can't really use KNN algorithms to help with recommendations, because you only have a few people to compare your (lets just say movie preferences) to. What is the best way, then, to get the best recommendations both when your userbase is large and small?

Those are just a few off the top of my head, but I'll be sure to add more later on.

1) Pair-wise approaches, if I understand what you mean, are those in which you are look at pairs of examples. You want to avoid naive implementations, in which you train on a quadratic number of instances (all-pairs). It really depends upon the specific problem how you get around this. For example, if you have a model that operates over pairs, doing example selection in an intelligent way. For example, use all positive pairs, but sample only one negative pair per positive pair.

2) You are talking about the cold start problem. Basically, the problem is how do you deliver good recommendations to new users as quickly as possible. I am actually currently talking to a big company about this problem.

Google has talked about how they do this for Google News, especially because they have high item churn (news is not news in a day). http://citeseerx.ist.psu.edu/viewdoc/download?doi= Essentially, they do clustering of users and news items.

Another approach is to give the user a hunch-like quiz when they join. Basically, you use a decision tree to cluster the users, getting as much information from them with as few questions as possible.

Another approach that requires almost NO user interaction is to use some ambient information about the user. Maybe you have a cookie for the user and other sites will share the user's behavior with you. Maybe you ask them to log in through facebook or twitter, and then can look at their profile and activity history to build a user profile. So getting creative about possible sources of ambient information is another approach.

I need some clarification on #1:

Could you provide easy examples where you would use a pair-wise approach, and a list-wise approach? Like, let's say I wanted to build a recommendation system for college football pick-ems (Ie, will Iowa beat Michigan State? Etc.). Which method would be preferred, and why?

Another question:

Let's say (again) that I want to build a recommendation engine for college football picks. How do I know how much (more) data will help for training purposes? IE, will only Win/loss records suffice [probably not], or should I add in the opponent they beat (or were beaten by) as well? What about whether the game was away/at home? What about factoring in the individual players on each team, and their stats? What about the temperature that day? How you do end up knowing when you can stop looking at smaller and smaller variables with your training data, and have something that gives the best results?

You can do greedy feature selection: make a stable train and development and test (and maybe another truer test set, for use later) sets and define an appropriate quality measure. Then implement the simplest thing you can think of, train on the training data, calibrate the results on the development set (tuning hyperparameters, etc) and look at the performance on test data. That's your baseline. Now think up of a nice new feature you'd like to add, implement it, and see if you can get the error on the development set to go down. If it goes down a lot (you can decide an appropriate threshold, maybe depending on the computational and human cost of using those features), keep it. Otherwise throw it away. Now repeat this process for every set of features you can think of, documenting the combinations you've tried as you go to see where is the best effort/performance tradeoff. Then test your best model on the test set to see if the performance has really improved.

If your model is linear, you will probably see diminishing returns as you implement more, different features. It also helps to look at model errors to see which features are pulling things the wrong way, and then you can add other features to compensate.

(feature engineering + linear models feels a lot like writing an old-school AI heuristic program, except you have a computer program give a numeric weight the rules you're writing, so you're free to write a lot (millions) of rules and still get a manageable model)

Re "Could you provide easy examples where you would use a pair-wise approach, and a list-wise approach? Like, let's say I wanted to build a recommendation system for college football pick-ems (Ie, will Iowa beat Michigan State? Etc.). Which method would be preferred, and why?"

I had written a longer reply detailing what are listwise methods and how they differ from pairwise methods, but apparently it was eaten by the web hyenas.

Turian's explanation doesn't actually cover what listwise methods are. I've seen them presented in the learning to rank context (learning to rank is when you want to build a machine learning system that given a set of documents, say search results, ranks them from more to less relevant to a query), so I'll follow this context here.

There are two very obvious approach to design a learning system that outputs sorted data: the first is to learn to assign a real number to each "document" (and then you sort according to these numbers) and the second is to learn a classifier that predicts given a pair of "documents" if a <= b (and then you can use this classifier as a comparison function in a sorting algorithm). Both these approaches have a common flaw, however, which is that they are very easily myopic, and will make decisions only looking at a very small window. This is clearly suboptimal in the learning to rank context because, for example, mistakes in the top elements of the ranked list are a lot more important than mistakes further down the list, and sometimes you want to maximize diversity or something like that, and it's hard to do that in an elementwise or pairwise approach.

So they invented listwise approaches, and there are actually two sorts of these: you can either learn a classifier that scores entire sorted sequences of documents (with features like document trigrams, features connecting similar documents, etc) or you can learn one of the above models with a loss function that depends on the entire sorted list of documents.

So essentially a listwise approach is better if you can do it, as listwise approaches come closer to minimizing error measures you actually care about (like precision among the top 10 documents) instead of bogus measures (like number of document pairs misclassified). On the other hand, precisely because listwise approaches allow you to be more specific, they are less generic, and it might be cumbersome to adapt one of them to your recommendation system.

Also, it fundamentally depends on how the results of your recommendation system are used. If you present something like top k recommendations then listwise approaches can be better, but in some other scenarios you will actually care about all the individual decisions, so a pairwise approach can do just as well.

Google News

Much as I love them, I've been getting so frustrated with the declining quality of their news page that I'm not sure if this is a good example. I know I'm not the only one.

This is a very general question.

I looked through the feature lists of many e-commerce analytics packages. It struck me that while they offer impressive-looking graphs and gather huge amounts of data, the "analytics" part seems to be very simple. They mostly present you the data that has been measured.

I am in a situation where I will want to provide e-commerce analytics as part of a more complete solution. But I would like to approach this from a different angle: provide digested data, with insight into patterns that could lead to increased profits for online stores.

What do you think could be extracted from e-commerce data that would be of value to online stores? I'm looking for information that is difficult to extract (e.g. requires a tech advantage), but can be extracted automatically or with minimum human curation.

An example of an interesting direction: e-commerce analytics companies offer "customer segmentation" views. But they are all just simple data views: you can segment based on location, time of visit, etc. I think doing a factor analysis on customer buying patterns would be much more interesting and would provide real value to the store by automatically extracting "customer types" from data. One could then market directly to these customer groups.

So, to rephrase, I'm looking for ideas/pointers on what kinds of analyses would be difficult to create, but very useful for e-commerce.

I'm agreed that most so-called analytics has little value. Your customer segmentation idea is a good one. You could also approach this as a classification problem, where you attempt to classify customers as purchasers or non-purchasers for each product or product group. Another idea is to look at high and low value pathways through a site. If you can identify pathways that don't lead to purchases that might suggest where design effort should be spent.

I am currently working on this problem with one client. We are doing different sorts of market analysis, by analyzing products. In particular, we are looking at optimizing profits by choosing product titles more effectively, as well as optimizing profit by choosing price points more effectively. This can be used purely automatically, as black box decisions, or this information can be presented in a visualization interface for a human analyst.

We've also talked about several other ways of turning historical sales data into actionable business intelligence.

One thing we haven't looked at is doing market analysis based upon customer profiles. This is an interesting idea, and I have some thoughts on how to do it. Essentially, finding different ways to cluster users, and then "explaining" who these customer segments. The explainability is based upon figuring out the most important features they have in common, and exposing these features in plain English.

I might also answer some questions here. I'm a post-doc in NLP, but more from the linguistics side of things --- less business focussed, and less of a hacker.

(Hi Joseph --- we met briefly at ACL. I was one of James Curran's students. My name's Matthew Honnibal.)

Any good materials to read? I work in a text-analytics startup, so I do have a basic understanding about NLP and related fields.

Good materials to read? I work in a text-analytics startup, so I do have a basic understanding about NLP and related fields.

I think have a solid ML understanding can serve you well if you are doing NLP.

* Good Freely Available Textbooks on Machine Learning? http://metaoptimize.com/qa/questions/186/

Besides that, I guess it depends specifically on what NLP tasks you are looking at. Can you tell me more?

Thanks! We primarily work on sentiment analysis, anaphora resolution, topic identification. Also work on topic classification and clustering.

If you do anaphora resolution you should maybe look at this recent paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=

it's simple enough to get running, fast, and outperforms the state of the art of a couple of years ago.

Also, any good materials for those of us who have no background in NLP besides basic computer science theory?

Good materials for those of us who have no background in NLP besides basic computer science theory?

My main proposal is to post a "How do I build X" question on MetaOptimize. (I don't know any place on the web where experienced NLP people chat, besides MetaOptimize, but I would be interested to hear other options.) NLP isn't really very hard once you know what you're doing, it's just that as an inexperienced person you will spend a lot of time going down dead ends and blind alleys, and there are a lot of pitfalls to avoid. So getting the advice of experienced people, right from the beginning, is the best way to do this.

See also:

* What are the best resources to use when starting machine learning for an experienced programmer? http://metaoptimize.com/qa/questions/334/

* How do I get started understanding statistics & data analysis? http://metaoptimize.com/qa/questions/154/

* New to data mining - where to start? http://metaoptimize.com/qa/questions/362/

Read the NLTK book available online for free here: http://nltk.org

The nltk tools them-selves might not always be state of the art (e.g. compare with the java libs from the NLP dept in Stanford) but at least you will get to know all the major concepts.

Other interesting book, although not completely finished yet: http://www.manning.com/ingersoll/ Taming Text by Grant Ingersol, Thomas Morton and Drew Farris involved in the Apache Lucene / Mahout and OpenNLP communities.

I am just a beginner into this field, but I will really like to venture out into NLP and ML, what are the best resources that I should start with?

I just posted answers for both your questions.

Thanks! Have some reading to do :)

Here is my question: http://metaoptimize.com/qa/questions/3086/what-is-the-best-w...

For MSc in Computer Science I wrote my dissertation on sentiment analysis using NLTK for my NLP needs. Although my work was nothing earth shattering, I really enjoyed it and have continued it as a side project investigating different techniques and algorithms. I'm now considering ways that I can turn this interest into my full time career (I'm a .Net developer currently). I don't particularly wish to send out a lot of unsolicited resumes and so was wondering if anyone had some advice on who best to approach? Are there any recruiters and the like that specialize in NLP?


Adam p.s. I am an US permanent resident and an Australian Citizen, the website that has my details and some example work is www.emptyorfull.com

Write an open source library.

I still get contacted about a ML library I haven't touched for 5 years.

What are some of the unique business models around data monitization? The most obvious are pay walls and generating content to drive advertising. Any thoughts on how successful paid access to data is, like with imdb pro?

I don't think people really want to pay for data.

I also think that people don't like having ML or NLP tools through a SaaS API, because you want a tool that you manage on your own hardware.

I believe that we'll see something I call the "machine learning business model". Essentially, you release your code under open source, and you give away a basic model, but then you charge for a premium model. The model is essentially like compiled code. No one knows how you trained it up, or why it works. So you get people using your tools by open-sourcing them, and then you upsell them on increased accuracy.

I am a first semester grad student seeking an M.S. in computer science. I am interested in NLP and hoping to choose it or a subfield of it as my research topic. Do you have any advice? For instance, is there background knowledge that I should be learning that would be helpful? What about good books or important papers? Other tips for CS/NLP grad students?

I know this is a rather general question, but any advice that you might have would be useful to me and students in similar situations. Thank you!

Hopefully the OP will be able to answer your question, but I thought I might offer my 2 cents:

I'm currently working on an MSc in IT. My actual project does not involve NLP directly, but all the 'related research' does (ie. all previous systems which achieve the same goal as mine)so I had to read up on it. This was a struggle... YMMV. I find a few good resources along the way though - recommend reading this first http://www.staff.ncl.ac.uk/hermann.moisl/ell236/manual1.htm then on iTunesU/YouTube you'll find a good course on computational linguistics from MIT, watch the first half dozen lectures at least (don't worry if you don't fully understand them all). A lot of the theory can be traced back to Chomsky's work on grammars, so I suggest reading some of his papers (his style is quite clear). I found I needed to do all this just so I could properly follow the recent papers on NLP (specifically NL interfaces to DBs) that I must review. I really had to start from the beginning, light wikipedia'ing wasn't cutting it. I'm currently reading The Oxford Handbook of Computational Linguistics to consolidate my knowledge, it's pretty good.

So anyway... tough subject, try not to get bogged down. If all the reading's too dry, then 'Godel Escher Bach' is a fun alternative that touches on similar theory.

If I were to choose a pure NLP project myself, I'd maybe do something like 'How NLP can add value to existing systems/software.' ie. look for imaginative practical applications, rather than theorizing yet another parser with some new trick/quirk

I would focus on statistical, data-driven approaches to NLP, rather than formal or linguistic approaches. I would develop a solid foundation in ML. Lastly, I would focus on building large-scale systems. The main reason is that this will increase the generality and breadth of what you can build and ship. But my approach is very applied.

Good books or important papers is an involved question. If you ask on MetaOptimize Q+A, we can discuss it there.

Can I get to you by email? I have a question on data monetization.

Ending my 2nd year of a CS degree and I have enough technical knowledge to begin coding up websites and services like many of the other people here on HN. I was wondering whether or not it would be better to run ads on these sites, or to just try and generate large/high quality datasets and then sell those. How does someone like me monetize their data?

This question is far too open-ended to answer.

I would avoid a ad or data-selling business upfront. I would focus on a simple transactional business (i.e. they give you money and you give them a product or a service for a month), and learning about your market and why they care about what you're offering.

Are you able to provide examples of it? i.e. stock exchange sell data feed, etc.

One question that definitely has plagued me for a while is that of evaluation schemes of recommender systems... On multiple occasions I have designed recommender systems that give a fairly good user experience, but I just have not been able to quantify these results... any pointers on that will be really appreciated.

xlvector has a blog on recsys that talks about a good but overlooked feature of those systems from a user point of view: the ability to have the user make fortunate discoveries while looking for something seemingly unrelated:

+ http://xlvector.net/blog/?tag=serendipity

I have no easy answer on how to quantify that myself but I guess the papers quoted by xlvector must be a good place to start.

Looking at things from a legal standpoint, how do you go about creating your database? What are you must conservative about when using APIs for data mining?

How do you approach social websites without APIs about scraping their data?

What criteria do you use to determine whether an algorithm you are developing qualifies for legal copyright protection?

Do any of the principles of ML be applied to KPI / balanced scorecard applications? I have done some work with calculating simple linear regression but would like to move towards decision tree learning. Is this a good idea? or is there something more applicable? BTW, I'am a noob at this! Thanks in advance.

I'm working on a shopping startup and need to get a basic recommender system built in the next few weeks. So far, I've implemented SlopeOne to determine the order of results every time a user searches. Is there any way to cache/memoize this repeated computation? Also, is SlopeOne the best way to go for this?

have you tried DirectedEdge?

What's the best way to normalize product names? Specifically, I would like to be able to take a recipe and get all the ingredients in that recipe. The ingredients are going to be written by humans so I need the ability to understand that ingredient and map it to something normalized that I know how to deal with.


I've built a little morphology parser in my spare time, and I'd love your thoughts.


My question about improving NER results, or finding a better library for tagging: http://metaoptimize.com/qa/questions/3092/improving-ner-resu...


If you're considering implementing a tagger, then I think this is a good paper to read for approaches that will improve your results in a developer time efficient way:


Thanks for posting this: I read this paper some time ago but completely forgot the name of the authors.

It is indeed a good "tricks of the trade" paper. I real enjoy such papers.

Yeah, it's nice when a paper boils down a task into the best bang-for-buck approach you should take.

A recent such paper I'm excited about is this one on coreference resolution from EMNLP 2010: http://www.aclweb.org/anthology/D/D10/D10-1048.pdf , by the Stanford group. It's a rule-based coreference resolution system you could implement in a couple of days that gets state-of-the-art results. Best of all it looks like you could get it to run about as fast as the parser that's backing it. I'm planning to write a version for the C&C parser.

My requirement is that I need to build a phrase extraction system. We have a corpus of text and we need to find the most relevant phrases from that corpus. Is there a simple algorithm/library I can use? Thanks in advance.

What is the single best result you have had applying your techniques. I don't care about who it was, but would love to know details about how much of an impact this type of work has in the real world.

Heh data monetization sounds good.. what about skill monetization? I'm getting a MS in computer science but am unsure what to do with it..

I'm particularly frustrated with brainless Java / .net development jobs.

Learn hadoop / lucene / pig / hive / clojure / python / ruby and put your code on github and if you can digest all of these you will get plenty of interesting job offers.

What are ways for someone without access to a Hadoop cluster (and doesn't want to continually pay for, say, Amazon's Elastic Map Reduce) to learn Hadoop/Pig/Hive? I'm a stats/data analysis guy, so I've been wanting to get into big data for a while, but I'm not sure how to progress beyond reading a bunch of tutorials (which I have).

What are the biggest challenges you meet in terms of missing software infrastructure to support your NLP/ML work? Things you wish someone would implement?

Something that is really missing in my opinion is a good opensource OpenCL [1] runtime for regular x86 CPUs using both muticore optims (a la openmp) and vector instructions (SSE). Bonus points if your OpenCL runtime can also leverage the nouveau and/or the open source ATI drivers.

Having such a runtime available as a standard lib packaged in all linux distribs would make it interesting for lowlevel vector math libs such blas / lapack / atlas and convex optimization solvers to have an implementation based on OpenCL kernels that would work (almost) as fast as the currently manually optimized C + simd intrinsics code but could also 10 to 100 as fast whenever a GPU is available on the machine without having to re-compile anything.

Some advanced and very promising machine learning algorithms (e.g. algorithms of the deep learning family) can really benefit from the computing power of vector processors such as the GPUs.

Right now everybody who wants to be able to gain the perf boost of GPUs uses CUDA but: - it's only useful on NVidia GPU machines (e.g. not on Amazon EC2 machines for instance) - it's not opensource hence has to be manually installed on every single machine you want to use it (no apt-get install from the official repos). This which makes it a heavy dependency for your pet machine learning library: be prepared to support CUDA installation problems on your project mailing list.

[1] http://en.wikipedia.org/wiki/OpenCL

If you like Python, Theano may be useful to you (haven't actually tried it):


Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:

- tight integration with numpy

- transparent use of a GPU

- symbolic differentiation

- speed and stability optimizations

- dynamic C code generation

- extensive unit-testing and self-verification

Yes I know and use theano for some deep learning experiments. This is really a great tool. But theano will probably never be considered a default dependency for common machine learning libs as long as there is no good OpenCL runtime opensource and pre-packaged in major posix distros (linux and bsd, osx already has opencl by default but not opensource hence cannot be reused in linux AFAIK).

Ok I guess my last question wasn't well-worded, so a better question:

What do you recommend reading to learn about data monetization basics?

What in your opinion is the best and most extendable open source toolkit or library for machine learning?

NLTK? Weka? Other?

It really depends upon the task. For each task (e.g. POS tagging), I maintain a list of current open source implementations. When I attack a problem that involves this task, I weight the tradeoff between how easy the tool is to use with how accurate the tool is. Most of the time, I favor ease-of-use. If I need high-accuracy, I start hand-rolling my tools and keeping them around for later.

p.s. I don't actually use NLTK or Weka that often.

I would advise against buying into a large framework too much, because in my experience they aren't very performant, and you usually do need to care about speed here.

I think you're better off taking the modules you need from open source libraries, favouring fast C and C++ implementations, and then writing the glue yourself. That way, you won't pay for abstractions you don't need. The glue shouldn't take very long to write.

What kind of data and applications are best suited for monetization using data mining? Now? In 5 years?

Do you know of any ML / NLP libraries in erlang?


How can the present state-of-the-art in NLP ML be more than a toy having limited scope if the most interesting, and truly useful real-world applications require impossible amounts of hardware resources, along with algorithmic running times that will take until the heat death of the Universe?

Mainly I'm talking about the topics in and related to the famous paper What Computers Can't Do. http://en.wikipedia.org/wiki/What_Computers_Cant_Do

Before I get nitpicked for generalizing NLP/ML to all out "Hard AI", I would like to note that almost all of the most trivial language processing tasks we meatsacks perform a billion times per day in our daily lives involve the semantic interpretation of abstract representations. It would be trivial to sit down and begin enumerating specific types of word and association problems that a child could perform, but which no supercomputer with a 1,000 engineers toiling it on could solve in the general sense.

By no means am I saying that everyone in the field should not full court press ahead on the state of the art, as we'll never get there unless we try. I am asking about the extent that real-world NLP/ML use is constrained enough so as to be nearly worthless hokum buzzwords. (i.e. analytics, recommendation engines for pre-conditioned, overtrained and context-free representations, behavioral profiling for ads, statistical models of dynamical systems, etc.)

I hope this doesn't come off as being negative, as I'm really more interested in the question of Hard AI and how it relates to computational linguistics problems we can solve today or in the near future without some kind of unforeseen "singularity" level break through in AI. Or is the general state still as bad as when Minksy demolished perceptrons, causing the "nuclear winter decade" in AI?

I don't have a PhD, just a pure math undergrad with a high motivation to have kept going at it, and this topic itself is so obscure & difficult that it's not often one has the chance to ask someone who knows, so I am very interested in the perspective of present day researchers.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact