Hacker News new | past | comments | ask | show | jobs | submit login
Boosting Sales with Machine Learning (medium.com/xeneta)
323 points by ingve on June 8, 2016 | hide | past | favorite | 75 comments

As someone with little experience in this area, I appreciated the step-by-step examples of cleaning and transforming the data into a suitable vector format for the training step. Very nice!

Is it typical that ~2000 samples (1000 leads, 1000 non-leads) is "enough"? I see that accuracy on the training set was ~86% and that the results are "starting to become useful for our sales team", but I would have thought more samples would have been needed. (I guess that since you're always collecting more real-world data you can continue to train the model and so it should get better?)

What's "enough" depends entirely on what you're doing with it and how much accuracy you require. In this case, 86% was enough to improves their batting average and so was worth using.

If they were trying to detect pedestrians from a self-driving car then they'd need a lot more accuracy and so a lot more training data.

86% accurate should be a solid 7x speedup compared to manual selection, not to mention that the algo could also pre-populate description fields and only require a visual check-up instead of manual typing.

> Is it typical that ~2000 samples (1000 leads, 1000 non-leads) is "enough"?

In terms of practical machine learning, plotting learning curves are a great way to know if samples are "enough" [0, 1]. If your algorithm under fits (high bias) both errors (training and validation) will be high and in this case, adding training samples would not help.

[0] http://www.astroml.org/sklearn_tutorial/practical.html

[1] http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-ov...

2000 samples is what we needed to get going yes. We're also noticing variable accuracy on different real world dataset. Meaning the algorithm is biased towards our current dataset.

Getting more data is one of our top priorities going forward with this project.

A follow-up question: In your experience, what's the cost difference between:

1) Mis-identifying a lead as a non-lead.

2) Mis-identifying a non-lead as a lead.

I would guess that (1) would be more costly if non-leads go to somewhere where they aren't followed-up on. But I'd appreciate the insight.

Thanks again!

Hi, as Per that wrote this is a developer while I do the sales that come out of this, I'll answer this:

Mis-identifying a lead as a non-lead is potentially loosing out on a big deal that can make or break your company. You never know what email will lead to a quickly closed $30k ARR sale, which are golden for any SaaS startup.

The reverse has almost no consequences unless you're really going to town with the emailing and end up being flagged for spam. Usually people just ignore you (not so smart) or write back that it's not relevant.

Your assumption is right, in terms of business cost, we'd rather see a false positive than a false negative.

I don't think the scikit learn algorithm differentiates between the two types of errors, in terms of cost.

Though it seems to give less false negatives than false positives overall, when testing on new datasets.

I'll put in the f1 score in the article when I have time.

Here's a whole post on that subject (or it might be Part 1 :)


This is exactly the link I was going to post. I think it does a great job showing that accuracy is not always the most appropriate metric for a model.

Statistical uncertainty grows like sqrt(N), and percent statistical uncertainty shrinks like sqrt(N) / N. So 2000 samples has a percent statistical uncertainty of about 2% - this is likely on par with or smaller than the systematic uncertainty (errors which don't depend on the size of the data). This means 2000 samples is probably fine for what they are trying to do, especially because the dataset is balanced with each class having 1000 samples. However the more data you have the better your model will do.

The author mentioned that 2000 was the minimum, and they intend to keep training the model with future data.

What I find amazing is that they even need to process leads this way. That is, their TAM (companies shipping over 500 containers a year) is not something that isn't already well-defined and readily available. Possibly even contained entirely within the head of an experienced salesperson in that space.

Hi, in the US this data is openly available as shipping manifests are public info.

In the EU this is not public information, and as such there's no scalable way other than experience to find out if a company ships 500+ TEU.

After nine months I can easily spot what stands out about companies that ship 500+ TEU from a description, but this is by far faster.

I think it's time for us to start building better systems to protect consumers from this escalating arms race of advertising tools and trickery.

Some sort of little financial coach that asks you about purchase patterns, your state of mind before and after, how you feel about the whole thing in retrospect, and why it is you think you 'need' this thing right now.

If it turns out you compulsively buy baseball hats for the wrong team when you're drunk, maybe your phone should ping you if you walk into a sporting goods shop after you've been in a bar for three hours. Then it shows you a picture of your daughter and reminds you that you PROMISED that this summer you'd take her to Disney World.

Ummm, did you read the same article I read?

Because I read an article about a team using publicly available data about public companies, then training an algorithm to comb that data to determine which ones were likely customers (which would save their salespeople time and, probably, save time for unlikely customers who are no longer receiving an unwanted solicitation).

We're talking about enterprise sales in the logistics and transportation industry here. I doubt the final decision whether or not to buy this particular freight rate benchmarking tool is being done on an impulse. There are whole teams responsible for enterprise purchases who have already ripped apart this offer and know every common sales trick in the book.

Also, I can't help but smirk at the thought of mentioning Disney a couple sentences after talking about an "escalating arms race of advertising tools and trickery". Perhaps the phone should have pinged before you made that promise too.

I feel like a lot of these articles are popping up lately, this one just brought me to the point of expressing my frustration with the whole notion of trying to squeeze another 10% out of our customer base. Apologies for the borderline non sequitur.

>We're talking about enterprise sales in the logistics and transportation industry here. I doubt the final decision whether or not to buy this particular freight rate benchmarking tool is being done on an impulse. There are whole teams responsible for enterprise purchases who have already ripped apart this offer and know every common sales trick in the book.

Within the software industry this is a cliche. Someone who hasn't touched code in 15-infinity years makes a buy order for a demonstrably inferior product, and we waste $500k in labor and overhead costs so that he doesn't look bad for making a $200k order for solutions nobody wanted. Whether they intend to or not, tools like these are going to pick up on patterns of weak judgement and exploit them. Really the same problem with A/B testing.

> Also, I can't help but smirk at the thought of mentioning Disney a couple sentences after talking about an "escalating arms race of advertising tools and trickery". Perhaps the phone should have pinged before you made that promise too.

Haha. Touche. It was the first thing that popped into my head when trying to think of a common social obligation that is difficult to fulfill if you can't manage your finances.

If we take the broad context and not just this particular article then yes, I can understand the growing frustration. It's a sentiment shared by the others. There was an article here about Facebook's latest language understanding tool . The comments focused mostly on how filter bubble and how tools like that fuel advertising sales. Only a handful of comments were about the main topic.

Reading the article, I couldn't help but wonder about the cost for doing something like this for a typical small business...i.e. where this work had to be done by a contractor. It looks like a big line item for a company with where the inhouse tech expert knows Excel and how to plug in an ethernet cable.

It's not that I don't think it's useful, I just wonder about the ROI for cash constrained businesses.

I'm by no means an expert at machine learning, but, given how organized the scikit learn libraries are, building a simple classifier as show in the example would be a few days of work at the max. In fact, an initial first version can be built within a day. After that, one has to tune the hyper parameters and spend time with feature selection to improve the baseline accuracy.

The most important thing will be the training data. You need a good number of samples, and the data also needs to be reasonably "clean".

At US rates, that smells like a few tens of thousands of dollars. At the core of my "concern" is that magnitudes of turnover, margins, and required increases in sales due to the analysis make application of the idea uneconomical.

To put it another way, the business case feels week most [i.e. small] profit seeking enterprises.

"few days of work" for "tens of thousands of dollars" seems a bit absurd. Are you assuming they are making 10K per day? Seems a bit high. I would assume 200-300/hour tops.

The amount of time you can spend preparing the training data is unbounded. The number of times you can do the training with data that ends up not actually looking like what you see in-the-wild a week later is unbounded. When all is said and done, the yak-shaving alone will be tens of thousands.

At the rate of a couple of hundred bucks an hour that I'd expect to pay for a qualified consultant, 200 hours works out toward the high end of the few in "a few tens of thousands".

How do you cram 200 hours into a few days?

Just like with the vast majority of project forecasts the "few days" is what you say to get the sale - internally or to outside clients. If you think it is that simple, well, I would like to sell you just a few days of machine learning expertise if you have a project... :) Even very simple tasks that you can let the intern do can - and often does - take days longer than projected.

I think I misread the comment as hours.

There are lots of tools to help with the hyperparameter problem to make that faster/cheaper as well. This problem is often orthogonal to the domain expertise required to do good feature selection.

Scikit-learn implements things like grid search natively [0], and tools like SigOpt [1] (YC W15, disclosure: I'm a founder) do this automatically as well.

[0]: http://scikit-learn.org/stable/modules/generated/sklearn.gri...

[1]: https://github.com/sigopt/sigopt_sklearn

Yup, hence the need to have it open source, otherwise that's just too much time spent for little return (leads datasets are too limited for significant uplift)

Bag of Words is not actually a great approach to understand text because it ignores the semantics of the word. For example, 'hotel' and 'motel' which are similar words have completely different vector representations in the BoW model.

A popular alternative is to use a distributed word embedding such as word2vec[1], where similar words are grouped together in the vectorspace.

Edit: If there are few observations, like in this case, we don't need to train the word2vec model on the dataset itself. We can use pre-trained word embeddings such as the one publicly released by Google which was trained on the Google News dataset.


grouping dont always yield better results, and I think it would probably do pretty poorly in this case because there are few observations.

Random Forest won in the samples tried, but I wager a support vector machine with a histogram kernel would do fantastic.

We could always use pre-trained word embeddings, the few observations won't matter then.

this is very industry specific, finding them that are meaningful for the data at hand seems like a small chance.

edit: im not saying dont try it, i certainly would! lets look at the data on github? maybe we could have a wack at the data

Awesome post, thanks for writing it up!

I haven't been very involved in using random forest at work (yet, I hope to), but I've done various mathematical programming work in the past to generate business insights (mainly through regressions and linear programming/optimization).

One thing that you make very clear through this blog post is how much value comes from the "non-technical" aspects of mathematical decision analysis. You have to see the application, find the data, clean the data, figure out what to actually put into the model, and get results in a way that can lead to an actual outcome with value.

Here's the thing, the reason I put "non-technical" in quotations is that it's actually a mix of technical and non-technical. You need to be aware of how these algorithms work and how they are implemented in order to have that insight. There's the old statement that everything looks like a nail to someone with a hammer, but knowing what tools are and what they can do can help frame issues in a way that you can approach them from new angles. This is why I do think it's worth learning various ML and other algorithms (like LP, NLP, etc) through contrived examples - once you understand them, you'll start to see the opportunity to apply them.

One last thing - kaggle. Kaggle is super fun, and I highly recommend it for people looking for an opportunity to try this out and learn it. However, good real world data science probably has less to do with making exceptional refinements to models. You know that data set you get when you are doing a kaggle competition? That's a huge amount of the actual work, right there.

You can do so much with basic RF and KNN (and with LP for that matter). This post is a pretty good illustration of this.

Anyway, pretty cool, thanks for sharing.

As another plug for Kaggle - it lets you know what is state of the art. For instance, from my Kaggle experience I know that gradient boosted decision tress (specifically xgboost) are virtually always superior to random forests. They are also basically just as easy to use, in contrast to neural nets which are not as user friendly. The machine learning step was probably the easiest step in this process, but it is also easy to leave simple gains on the table. Gradient boosted decision trees don't get nearly enough hype.

I would even argue it's easier to use that random forests. The standalone R package is super slick, with built in cross validation. And if you want to tune the hyperparameters you can just do it in caret. Really accessible stuff.

Yeah on Kaggle I found myself getting routinely outperformed by people with scripts that simply ran a gradient boosted decision tree and nothing else. And yet the topic was never mentioned in my modeling and stats courses!

That's because it's practically new. Tianqi Chen authored the R package for it (original release Aug 2015) and actually posts about it on the Kaggle forums quite frequently.

Slightly off topic, but I am curious if Elasticsearch could used instead for the cleaning and transformation stages? You only need to configure your index to get stemming, stop word removal, etc. It seems like it would take less time to implement. You could also play with different TF/IDF algorithms by changing the config and reindexing.

Stemming and stop-word removal aren't that hard. Probably less time to set up say NLTK and write the 4-10 lines of python required than setting up Elasticsearch.

Hey everyone, this is Edvard from Xeneta.

If anyone has any questions about our sales process and how we use this day to day, fire away!

Did you experiment with any other ways to get company descriptions than FullContact? Their bio data from 'social profiles' seems a bit hit and miss or sparse for the companies I tried it out on.

Their social data isn't great, but it was the best alternative and their simple API made us choose it. It must be noted that all of this is a thousand times better than googling each individual company name.

If you're interested in giving our Company Search API (by name) a go, email me michael[at]fullcontact.com and I'll hook you up.

We're constantly trying to improve our company data, social especially, stay tuned for that. That said, I'd love to hear any feedback you have at the same address.

Hey Edvard, great writeup! Do you have any (anec)data about how well this is working? As in sales outcomes.

Any plans to open-source this?

Hey, on an anecdotal level this saves me a lot of time. As an example I can take the participants list from a logistics fair, run it through the script and come out with a prioritised list of companies to contact. It's quite new so we don't have any hard numbers so I can't say anything else than "really good". The code is already open-source and can be found here: https://github.com/xeneta/LeadQualifier

I feel naive bayes works pretty well with such text classification tasks. You might want to give it a try.

Vowpal Wabbit is also pretty easy to use and fast. I train and test on 100K text examples in under 1 minute. Works better than random forests and other things.

The choice of algorithm completely depends on the type of classification task at hand. Naive Bayes would be good for certain types of classifications problems but there could be better ones for another type of text classification problems.

naive bayes does well on small data sets, but would in general do poorly on larger text sets due to its independence assumption (which is horribly wrong in language).

Many companies think that machine learning techniques are reserved for the big guys only - wrong! I have some interesting cases, where a special offer is presented to chosen clients basing on multiple variables and patterns, and this approach makes some nice dollar each day. Another example - an anti-fraud system working in a quite niche domain, saving about thousand dollars each day.. Real fun starts when you look at your data from a different perspective,and most times it is worth the hassle!

Many times a simple logistic regression or SVM could do wonders, especially on datasets <100K examples. It's a matter of being aware such applications are possible. The code is usually less than a couple hundred lines, but takes some experimentation to get it right.

Interesting read. I was speaking with some colleagues just yesterday about a potential pet project to identify which of our customers have eCommerce websites.

The concept would involve processing millions of companies names found on the "Bill TO" field of sales records. Then using these records to populate a ElasticSearch index for use with Graph Query API to help further normalize/dedup the company names that share similar string semantics. The next stage of the process would be to scan the normalized, dedupped, list of company names and attempt to locate the company website URL by crawling the first page of Google search results. This would need to be metered because I assume Google would block me if I performed rapid attempts. After gathering a list of company URLs the plan would then shift gears into attempting to identify if any of the companies websites contain the typical components that make up an eCommerce website. Think searching the HTML for all variations of "add to cart", "shopping cart", "my account", etc.

Machine learning / big data work is currently comparatively costly. This is a nice example of the kind of benefits that lowering its cost bring. I expect we'll see a lot more of this in the future in sales and marketing (and probably other areas but sales and marketing are particularly easy to measure). One example is lead scoring. This is the process of working out which leads (potential customers) to pursue. Currently most people do this in an entirely ad-hoc way. E.g. download the whitepaper = 5 points, sign up to the newsletter = 10 points. It's crying out for simple statistical modelling and validation, but currently that's too expensive (in time, $s, and mental energy) for most people to undertake.

The only cost tied to these issues is the salary costs for a statistician/data scientist. The best tools of the trade are open source. You won't find lowering costs until more people are educated or trained up in these fields.

The biggest one in marketing is cross-channel attribution and valuing hard-to-measure channels like display, video, etc. Better data lets you avoid wasting dollars and better allocating your budget by properly assigning the weight of how much revenue contribution a given channel/campaign/placement/creative/etc. should receive.

This is arguably the toughest problem in the industry, and solutions by Google, Adobe, etc. are just starting to make headway, but are still very expensive and very custom with few exceptions (see the new data-driven attribution release for AdWords for example).

But many companies are making it available at a much less costly entry point. If you are interested in lower cost options check out hopdata.com (work here), monkeylearn.com, algrorithmia.com, google cloud machine learning, Amazon machine learning...

When I read the article in the beginning, I thought he was going do linear regression, later I believed that it was going to be logical regression. In the end, it was classification with clustering.

I guess you meant logistic and not logical regression?

GP certainly did, but "logic regression" is a real thing:


Yes, you are correct.

Nice analysis, I remember combining tf-idf with many other NLP techniques for my final year project and it was super easy to implement with nlp_compromise's tokenize():

    nlp.tokenize(text, { dont_combine: true }).reduce(function(sentences, nlp, key){

      var words = nlp.tokens.reduce(function(tokens, token){
        // ...

      // ...


Awesome post! And great example on how you can use Machine Learning to makes salespeople life easier! Have you tried MonkeyLearn? You can easily create machine learning models on the fly, have great tools to improve your models (like explore which samples are creating confusions) and y0u don't have to worry about deploying the model in your servers, maintenance, etc.

Wonderful exercise and writeup.

Just fyi, you can usually buy lists of buyers in target companies. They exist for most markets, though it's possible such a list may not exist for your market. These lists will give you actual names and contact information, and are probably a more efficient way to contact potential buyers.

You might want to check out Clearbit Company API. As far as I remember, you can search companies by string using their API. https://clearbit.com/discovery

Edit: Added URL

I'd like to see this method compared to random sampling and something really simple like, say, probability matching via tinkering. Because for me it looks a bit too complex to work good.

There must be a lot of cost to this, not just in data processing, but the time to set this up. Still, I expect to see more and more ML projects until it becomes cheap.

No offense, but this is very very basic stuff (PhD in Statistics)...

wingcommander, can you let us know, what you would have done differently to improve the accuracy?

I'm willing to bet most startups or businesses in general do not have a statistician on staff. Or even as a personal contact.

I kind of think the basic-ness is the point here. The article is mainly about identifying the opportunity, finding and cleaning the data, making some choices about what to feed into the model, and applying results.

Having worked in business-related mathematics (did an MS in industrial engineering/ops research), I have definitely noticed how critical the "non-technical" aspects are, and how much mileage you can get out of relatively basic stuff if you do those steps well (and how little mileage you get out of sophisticated stuff if you haven't).

you must be fun at parties

I didn't intend any disrespect.

I just think the title was a bit sensationalist.

When you know more than someone else does about a topic, the best way to comment on that topic here is to share some of what you know. Then we all learn.

Comments that are only dismissive, or are supercilious about knowing more than others, are deprecated here. It would be a good idea to read the following which describe what we're looking for on the site:



Oh, and welcome to Hacker News! (I'm a moderator here.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact