
Boosting Sales with Machine Learning - ingve
https://medium.com/xeneta/boosting-sales-with-machine-learning-fbcf2e618be3
======
stygiansonic
As someone with little experience in this area, I appreciated the step-by-step
examples of cleaning and transforming the data into a suitable vector format
for the training step. Very nice!

Is it typical that ~2000 samples (1000 leads, 1000 non-leads) is "enough"? I
see that accuracy on the training set was ~86% and that the results are
"starting to become useful for our sales team", but I would have thought more
samples would have been needed. (I guess that since you're always collecting
more real-world data you can continue to train the model and so it should get
better?)

~~~
taneq
What's "enough" depends entirely on what you're doing with it and how much
accuracy you require. In this case, 86% was enough to improves their batting
average and so was worth using.

If they were trying to detect pedestrians from a self-driving car then they'd
need a lot more accuracy and so a lot more training data.

~~~
visarga
86% accurate should be a solid 7x speedup compared to manual selection, not to
mention that the algo could also pre-populate description fields and only
require a visual check-up instead of manual typing.

------
OhHeyItsE
What I find amazing is that they even need to process leads this way. That is,
their TAM (companies shipping over 500 containers a year) is not something
that isn't already well-defined and readily available. Possibly even contained
entirely within the head of an experienced salesperson in that space.

~~~
chipsanddip
Hi, in the US this data is openly available as shipping manifests are public
info.

In the EU this is not public information, and as such there's no scalable way
other than experience to find out if a company ships 500+ TEU.

After nine months I can easily spot what stands out about companies that ship
500+ TEU from a description, but this is by far faster.

------
hinkley
I think it's time for us to start building better systems to protect consumers
from this escalating arms race of advertising tools and trickery.

Some sort of little financial coach that asks you about purchase patterns,
your state of mind before and after, how you feel about the whole thing in
retrospect, and why it is you think you 'need' this thing right now.

If it turns out you compulsively buy baseball hats for the wrong team when
you're drunk, maybe your phone should ping you if you walk into a sporting
goods shop after you've been in a bar for three hours. Then it shows you a
picture of your daughter and reminds you that you PROMISED that this summer
you'd take her to Disney World.

~~~
duckmysick
We're talking about enterprise sales in the logistics and transportation
industry here. I doubt the final decision whether or not to buy this
particular freight rate benchmarking tool is being done on an impulse. There
are whole teams responsible for enterprise purchases who have already ripped
apart this offer and know every common sales trick in the book.

Also, I can't help but smirk at the thought of mentioning Disney a couple
sentences after talking about an "escalating arms race of advertising tools
and trickery". Perhaps the phone should have pinged before you made that
promise too.

~~~
hinkley
I feel like a lot of these articles are popping up lately, this one just
brought me to the point of expressing my frustration with the whole notion of
trying to squeeze another 10% out of our customer base. Apologies for the
borderline non sequitur.

>We're talking about enterprise sales in the logistics and transportation
industry here. I doubt the final decision whether or not to buy this
particular freight rate benchmarking tool is being done on an impulse. There
are whole teams responsible for enterprise purchases who have already ripped
apart this offer and know every common sales trick in the book.

Within the software industry this is a cliche. Someone who hasn't touched code
in 15-infinity years makes a buy order for a demonstrably inferior product,
and we waste $500k in labor and overhead costs so that he doesn't look bad for
making a $200k order for solutions nobody wanted. Whether they intend to or
not, tools like these are going to pick up on patterns of weak judgement and
exploit them. Really the same problem with A/B testing.

> Also, I can't help but smirk at the thought of mentioning Disney a couple
> sentences after talking about an "escalating arms race of advertising tools
> and trickery". Perhaps the phone should have pinged before you made that
> promise too.

Haha. Touche. It was the first thing that popped into my head when trying to
think of a common social obligation that is difficult to fulfill if you can't
manage your finances.

~~~
duckmysick
If we take the broad context and not just this particular article then yes, I
can understand the growing frustration. It's a sentiment shared by the others.
There was an article here about Facebook's latest language understanding tool
. The comments focused mostly on how filter bubble and how tools like that
fuel advertising sales. Only a handful of comments were about the main topic.

------
brudgers
Reading the article, I couldn't help but wonder about the cost for doing
something like this for a typical small business...i.e. where this work had to
be done by a contractor. It looks like a big line item for a company with
where the inhouse tech expert knows Excel and how to plug in an ethernet
cable.

It's not that I don't think it's useful, I just wonder about the ROI for cash
constrained businesses.

~~~
karterk
I'm by no means an expert at machine learning, but, given how organized the
scikit learn libraries are, building a simple classifier as show in the
example would be a few days of work at the max. In fact, an initial first
version can be built within a day. After that, one has to tune the hyper
parameters and spend time with feature selection to improve the baseline
accuracy.

The most important thing will be the training data. You need a good number of
samples, and the data also needs to be reasonably "clean".

~~~
brudgers
At US rates, that smells like a few tens of thousands of dollars. At the core
of my "concern" is that magnitudes of turnover, margins, and required
increases in sales due to the analysis make application of the idea
uneconomical.

To put it another way, the business case feels week most [i.e. small] profit
seeking enterprises.

~~~
up_and_up
"few days of work" for "tens of thousands of dollars" seems a bit absurd. Are
you assuming they are making 10K per day? Seems a bit high. I would assume
200-300/hour tops.

~~~
brudgers
At the rate of a couple of hundred bucks an hour that I'd expect to pay for a
qualified consultant, 200 hours works out toward the high end of the few in "a
few tens of thousands".

~~~
wlievens
How do you cram 200 hours into a few days?

~~~
LionessLover
Just like with the vast majority of project forecasts the "few days" is what
you say to get the sale - internally or to outside clients. If you think it is
that simple, well, I would like to sell you just _a few days_ of machine
learning expertise if you have a project... :) Even very simple tasks that you
can let the intern do can - and often does - take days longer than projected.

------
ankeshanand
Bag of Words is not actually a great approach to understand text because it
ignores the semantics of the word. For example, 'hotel' and 'motel' which are
similar words have completely different vector representations in the BoW
model.

A popular alternative is to use a distributed word embedding such as
word2vec[1], where similar words are grouped together in the vectorspace.

Edit: If there are few observations, like in this case, we don't need to train
the word2vec model on the dataset itself. We can use pre-trained word
embeddings such as the one publicly released by Google which was trained on
the Google News dataset.

[1][https://word2vec.googlecode.com/](https://word2vec.googlecode.com/)

~~~
autokad
grouping dont always yield better results, and I think it would probably do
pretty poorly in this case because there are few observations.

Random Forest won in the samples tried, but I wager a support vector machine
with a histogram kernel would do fantastic.

~~~
ankeshanand
We could always use pre-trained word embeddings, the few observations won't
matter then.

~~~
autokad
this is very industry specific, finding them that are meaningful for the data
at hand seems like a small chance.

edit: im not saying dont try it, i certainly would! lets look at the data on
github? maybe we could have a wack at the data

------
geebee
Awesome post, thanks for writing it up!

I haven't been very involved in using random forest at work (yet, I hope to),
but I've done various mathematical programming work in the past to generate
business insights (mainly through regressions and linear
programming/optimization).

One thing that you make very clear through this blog post is how much value
comes from the "non-technical" aspects of mathematical decision analysis. You
have to see the application, find the data, clean the data, figure out what to
actually put into the model, and get results in a way that can lead to an
actual outcome with value.

Here's the thing, the reason I put "non-technical" in quotations is that it's
actually a mix of technical and non-technical. You need to be aware of how
these algorithms work and how they are implemented in order to have that
insight. There's the old statement that everything looks like a nail to
someone with a hammer, but knowing what tools are and what they can do can
help frame issues in a way that you can approach them from new angles. This is
why I do think it's worth learning various ML and other algorithms (like LP,
NLP, etc) through contrived examples - once you understand them, you'll start
to see the opportunity to apply them.

One last thing - kaggle. Kaggle is super fun, and I highly recommend it for
people looking for an opportunity to try this out and learn it. However, good
real world data science probably has less to do with making exceptional
refinements to models. You know that data set you get when you are doing a
kaggle competition? That's a huge amount of the actual work, right there.

You can do so much with basic RF and KNN (and with LP for that matter). This
post is a pretty good illustration of this.

Anyway, pretty cool, thanks for sharing.

~~~
selectron
As another plug for Kaggle - it lets you know what is state of the art. For
instance, from my Kaggle experience I know that gradient boosted decision
tress (specifically xgboost) are virtually always superior to random forests.
They are also basically just as easy to use, in contrast to neural nets which
are not as user friendly. The machine learning step was probably the easiest
step in this process, but it is also easy to leave simple gains on the table.
Gradient boosted decision trees don't get nearly enough hype.

~~~
bllguo
Yeah on Kaggle I found myself getting routinely outperformed by people with
scripts that simply ran a gradient boosted decision tree and nothing else. And
yet the topic was never mentioned in my modeling and stats courses!

~~~
cshenton
That's because it's practically new. Tianqi Chen authored the R package for it
(original release Aug 2015) and actually posts about it on the Kaggle forums
quite frequently.

------
rpedela
Slightly off topic, but I am curious if Elasticsearch could used instead for
the cleaning and transformation stages? You only need to configure your index
to get stemming, stop word removal, etc. It seems like it would take less time
to implement. You could also play with different TF/IDF algorithms by changing
the config and reindexing.

~~~
50CNT
Stemming and stop-word removal aren't that hard. Probably less time to set up
say NLTK and write the 4-10 lines of python required than setting up
Elasticsearch.

------
chipsanddip
Hey everyone, this is Edvard from Xeneta.

If anyone has any questions about our sales process and how we use this day to
day, fire away!

~~~
jschuur
Did you experiment with any other ways to get company descriptions than
FullContact? Their bio data from 'social profiles' seems a bit hit and miss or
sparse for the companies I tried it out on.

~~~
chipsanddip
Their social data isn't great, but it was the best alternative and their
simple API made us choose it. It must be noted that all of this is a thousand
times better than googling each individual company name.

~~~
Xorlev
If you're interested in giving our Company Search API (by name) a go, email me
michael[at]fullcontact.com and I'll hook you up.

We're constantly trying to improve our company data, social especially, stay
tuned for that. That said, I'd love to hear _any_ feedback you have at the
same address.

------
priyankt
I feel naive bayes works pretty well with such text classification tasks. You
might want to give it a try.

~~~
visarga
Vowpal Wabbit is also pretty easy to use and fast. I train and test on 100K
text examples in under 1 minute. Works better than random forests and other
things.

------
synweap15
Many companies think that machine learning techniques are reserved for the big
guys only - wrong! I have some interesting cases, where a special offer is
presented to chosen clients basing on multiple variables and patterns, and
this approach makes some nice dollar each day. Another example - an anti-fraud
system working in a quite niche domain, saving about thousand dollars each
day.. Real fun starts when you look at your data from a different
perspective,and most times it is worth the hassle!

~~~
visarga
Many times a simple logistic regression or SVM could do wonders, especially on
datasets <100K examples. It's a matter of being aware such applications are
possible. The code is usually less than a couple hundred lines, but takes some
experimentation to get it right.

------
tegansnyder
Interesting read. I was speaking with some colleagues just yesterday about a
potential pet project to identify which of our customers have eCommerce
websites.

The concept would involve processing millions of companies names found on the
"Bill TO" field of sales records. Then using these records to populate a
ElasticSearch index for use with Graph Query API to help further
normalize/dedup the company names that share similar string semantics. The
next stage of the process would be to scan the normalized, dedupped, list of
company names and attempt to locate the company website URL by crawling the
first page of Google search results. This would need to be metered because I
assume Google would block me if I performed rapid attempts. After gathering a
list of company URLs the plan would then shift gears into attempting to
identify if any of the companies websites contain the typical components that
make up an eCommerce website. Think searching the HTML for all variations of
"add to cart", "shopping cart", "my account", etc.

------
noelwelsh
Machine learning / big data work is currently comparatively costly. This is a
nice example of the kind of benefits that lowering its cost bring. I expect
we'll see a lot more of this in the future in sales and marketing (and
probably other areas but sales and marketing are particularly easy to
measure). One example is lead scoring. This is the process of working out
which leads (potential customers) to pursue. Currently most people do this in
an entirely ad-hoc way. E.g. download the whitepaper = 5 points, sign up to
the newsletter = 10 points. It's crying out for simple statistical modelling
and validation, but currently that's too expensive (in time, $s, and mental
energy) for most people to undertake.

~~~
dandermotj
The only cost tied to these issues is the salary costs for a statistician/data
scientist. The best tools of the trade are open source. You won't find
lowering costs until more people are educated or trained up in these fields.

------
feylikurds
When I read the article in the beginning, I thought he was going do linear
regression, later I believed that it was going to be logical regression. In
the end, it was classification with clustering.

~~~
190807
I guess you meant logistic and not logical regression?

~~~
XFrequentist
GP certainly did, but "logic regression" is a real thing:

[http://kooperberg.fhcrc.org/logic/](http://kooperberg.fhcrc.org/logic/)

------
franciscop
Nice analysis, I remember combining tf-idf with many other NLP techniques for
my final year project and it was super easy to implement with nlp_compromise's
tokenize():

    
    
        nlp.tokenize(text, { dont_combine: true }).reduce(function(sentences, nlp, key){
    
          var words = nlp.tokens.reduce(function(tokens, token){
            // ...
          });
    
          // ...
    
        });

------
feconroses
Awesome post! And great example on how you can use Machine Learning to makes
salespeople life easier! Have you tried MonkeyLearn? You can easily create
machine learning models on the fly, have great tools to improve your models
(like explore which samples are creating confusions) and y0u don't have to
worry about deploying the model in your servers, maintenance, etc.

------
not_that_noob
Wonderful exercise and writeup.

Just fyi, you can usually buy lists of buyers in target companies. They exist
for most markets, though it's possible such a list may not exist for your
market. These lists will give you actual names and contact information, and
are probably a more efficient way to contact potential buyers.

------
gsharma
You might want to check out Clearbit Company API. As far as I remember, you
can search companies by string using their API.
[https://clearbit.com/discovery](https://clearbit.com/discovery)

Edit: Added URL

------
juskrey
I'd like to see this method compared to random sampling and something really
simple like, say, probability matching via tinkering. Because for me it looks
a bit too complex to work good.

------
taranw85
There must be a lot of cost to this, not just in data processing, but the time
to set this up. Still, I expect to see more and more ML projects until it
becomes cheap.

------
wingcommander
No offense, but this is very very basic stuff (PhD in Statistics)...

~~~
instakill
you must be fun at parties

~~~
wingcommander
I didn't intend any disrespect.

I just think the title was a bit sensationalist.

~~~
dang
When you know more than someone else does about a topic, the best way to
comment on that topic here is to share some of what you know. Then we all
learn.

Comments that are only dismissive, or are supercilious about knowing more than
others, are deprecated here. It would be a good idea to read the following
which describe what we're looking for on the site:

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

[https://news.ycombinator.com/newswelcome.html](https://news.ycombinator.com/newswelcome.html)

Oh, and welcome to Hacker News! (I'm a moderator here.)

