My app, 7 Second Meditation, is solid 5 stars, 100+ reviews because I use bandits to optimize my content.
By having the system automatically separate the wheat from the chaff, I am free to just spew out content regardless of its quality. This allows me to let go of perfectionism and just create.
There is an interesting study featured in "Thinking Fast and Slow" where they had two groups in a pottery class. The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.
The second group crushed the first group in terms of creativity.
Moreover, the book "Thinking fast and slow" does not contain the word "pottery" (nor "ceramics").
Please excuse me for lumping Kahneman, Tversky and Taleb into the same bin (stating beforehand, feel free to dismiss or form your own opinions because of this). My justification is they cite eachother all the time, write about the same topics and are quoted as doing their research by "bouncing ideas off each other" (only to later dig up surveys or concoct experiments to confirm these--psych & social sciences should take some cues from psi research and do pre-registration).
This is now the second time I notice that one of their anecdotes posed as "research" doesn't quite check out to be as juicy and clear-cut (or even describing the same thing) as the citation given.
The other one was the first random cite I checked in Taleb's Black Swan. A statistics puzzle about big and a smaller hospital and the number of baby boys born in them on a specific day. The claim being research (from a metastudy by Kahneman & Tversky) showing a large pct of professional statisticians getting the answer to this puzzle wrong. Which is quite hard to believe because the puzzle isn't very tricky at all. Checking the cited publication (and the actual survey it meta'd about), turns out it was a much harder statistical question, and it wasn't professional statisticians but psychologists at a conference getting the answer wrong.
Good thing I already finished the book before checking the cite. I was so put off by this finding, I didn't bother to check anything else (most papers I read are on computational science, where this kind of fudging of details is neither worth it, nor easy to get away with).
Which is too bad because the topics they write about are very interesting, and worthwhile/important areas of research. I still believe it's not unlikely that a lot of their ideas do in fact hold kernels of truth, but I'm fairly sure that also a number of them really do not. And their field of research would be stronger for it to figure out which is which. Unfortunately this is not always the case for the juicy anecdotes that sell pop-sci / pop-psych books.
Fooled by Randomness also doesn't needlessly bash people for wearing ties or being French, like Black Swan does :) ... only later did I learn that Taleb was a personal friend of Benoit Mandelbrot (who was French), which put it in a bit more context (as a friendly poke, I'm assuming), but at the time I found it a bit off-putting and weird, as it really had nothing to do with the subject matter at all.
In the past I have used the share rate to optimize, but I've realized that retention is more important.
eg (guesses here)
new notification types
"think for 9 seconds about a loved one"
"what was your favorite vacation, go there"
if people act -> reward and reuse
if not -> reuse less (only by random chance)
alternative to notification could be "completed medidation screen"
I feel like that works partly because an important part of practice the feedback loop between continually practicing and having a sense of whether you did well or not.
Your strategy of not evaluating your own work sounds a bit like mushing clay into shapes with a blindfold on and then tossing it in kiln before you even check whether or not it's shaped like a pot. The users can sort through them later!
If the end goal is just ending up with a volume of work that's been culled down to the better ones, I guess you still get that. But it's inherently different from the Thinking Fast and Slow example where they're in a class and the goal is to learn and get better, rather than see who's made the nicest pot by the end of the semester.
We don't know from this study whether the second group was more or less likely to get trapped in the Expert Beginner phase of development.
You definitely don't get anywhere without practice, but you are likely to get nowhere fast without theory.
so then "Expert Beginner phase" = local minima, which might be reached quickly compared to a more systematic search (practice informed/infused with theory) and once in that local state, the "expert beginner" might be unable to escape because of over-reliance on small, poorly guided, step size; the real prize anyway is the global minima.
It's great app, but were the ratings lower in the beginning before the optimization or how do you know the optimization helped the ratings? I'm asking because it seems the app could have good ratings regardless of the content optimization because it's a "feel good" app. Are there counterexamples of of other meditation apps were the UI is good but it has bad reviews because of their low quality content?
Oh, that's a great goal to have...
"This app gives a text reminder to do what everyone wants to do: relax, love one's self and others, and bring peace and light into the world. I smile when the alert message appears and feel grateful to the makers of this app for creating a pleasant, quick way to meditate at the most stressful point of my day. I have recommended it to many people"
You have to do that sort of "curation" anyways anytime you make something. You have to continually decide whether it's worth it to keep working on something, and then decide if it's good enough to release. People tend to be pretty bad judges of this (especially for more creative tasks. There are many examples of someone's most famous song/book/painting not being their own favorite, or even being their least favorite!).
So why not relieve some of that stress, and let the users pick what they really like, instead of you guessing for them?
Efficiency is great, but there still needs to be some quality standards.
He might have leaned toward a little bit of hyperbole in "just spew out content regardless of its quality" for the purposes of clarity and expressiveness, which most of the readers have correctly parsed, but I don't think you have the right to tag orasis' output as shit based solely on the information of the page. That's a little bit rude, in fact.
For small changes like change the color / appearance of a button, the difference in conversion rate is not measurable. Maybe if you can test with traffic in the range of >100K unique visitors (from the same sources), you can say with confidence which button performed better.
But how many websites / apps really have >100K uniques? If you have a long running test, just to gather enough traffic, changes are some other factors have changed as well, like the weather, weekdays / weekends, time of month, etc.
And if you have <100K uniques, does the increase in conversion pay for the amount of time you have invested in setting up the test?
In my experience, only when you test some completely different pages you'll see some significant differences.
I have been doing design optimization for nearly a decade now and virtually every example I've seen made public is incredibly poor. Sometimes they do not even include any numbers, but just say, look first I had a 1% conversion rate and now I got 1.2%! I wish I had a good example of optimization case study, but I don't recall the last time I saw one.
Without giving anything away, I have done tests this year which required over a million impressions to find out in a statistically significant way not to make any changes.
My general theory is that a never optimized design with a confusing UI has a lot of low hanging fruit. You start cleaning up the bad elements and conversion rates can double or triple. Even if the sample size is less than ideal, these really big pops will be apparent.
Design knowledge is better now than it was in 2003. Mobile forces designers to use one column and leave out a lot of crap. There are a lot of good examples that get copied and good out of the box software. That means when you start optimizing, the low hanging fruit is gone and you need really big sample sizes. Once the low hanging fruit is gone, often those big samples just tell you not to make any changes.
Thinking about free-to-play mobile games recently (an area I have no experience in.) If 1% or fewer of users are converting you really do need a huge install base to beat your competitors at optimization. You need millions of users just to get to 20,000 or 30,000 paying players to test behavioral changes on. That means there actually is some staying power for the winners, at least for a while.
You have better things to do.
They tend to forget that Google has a billion unique users per day, and their website a few hundreds.
Here's the results of me changing an "add to cart" button from a branded looking maroon button to a simple yellow (Amazon style) button: http://cl.ly/0d440I3T333m
That's 26% sales increase from changing a button's color.
If you've got a good eye for usability, your intuition is going to lead to a lot of fun and great results with A/B testing. If not, you'll futz with things that make no difference most of the time and eventually give up. Experimentation is not about testing random permutations (unless you've got an infinite amount of time). It's about coming up with a reasonable hypothesis and then testing it. I study a web page until I come to a conclusion in my mind that something could really use some improvement. Then, even though I'm sure of it in my mind, I test it, because "sure of it in my mind" is wrong about half the time.
One note to those who use Optimizely: you do need to wait at least twice as long as Optimizely thinks, because statistical significance can bounce around with too little traffic. Optimizely thought the above test was done at about 2000 views, which was far too little results to be conclusive, with only ~20 sales.
Assuming the usual sqrt(n+1) error on a count, as n is low. Combining the uncertainties from 83/66  gives 1.258 +- 0.231 (1sd). It's only just likely that it is an actual improvement, but there is about a 30% chance that the orignal was better.
If we also include the errors on the total counts and use the fraction 83.0/66/6362x6392, then the error is:
which shows an improvement of 1.264+-0.234. Stastically nout.
b_old = Beta(66+1, 6392-66+1)
b_yel = Beta(83+1, 6362-83+1)
N = 1000000
# Sample from both distributions, count the fraction of samples that are better
sum(rand(b_old, N) .> rand(b_yel, N)) / N
I did an A/B test on an older framework which didn't automate statistical significance at all, but the website was getting more than 2000-3000 orders per day, so after a single week we had enough data to determine that sales had increased by 36% (reduced the page's load time by almost half, changed the checkout to use Ajax, and a few other small changes.) without the need to quantify things. In fact, at the time, I didn't even know what "statistical significance" was... not that I know too much more about statistics now than I did then.
Anyway, in theory all of the exact p values, etc. matter, but in practice, the bottom line is all that matters, because the p values can change in a moment based on something I haven't factored in. That's where, at least for the time being, intuition still plays a great part in being actually correct, which is why we still have people with repeated successes.
So: probably better, but should have run it longer.
I wanted Optimizely to say it was 100% significant for a full week of it running before I ended the test, but the chart was interesting to me, because the conversion rate difference between the two remained the same for the entire test, rather than there being a specific period where "yellow" excelled.
That's not how statistical significance works...
They'll need to reeducate their statisticians right away!
Optimizely is in a rough spot. People don't like having to think through experimental design, and they are really, really bad at reasoning about p-values. To try to fix the people part, they came out with the sequential stopping rule stuff (their "stats engine), but they never really published much justifying it. The other alternative would be to move the experiments into a Bayesian framework, but that has a lot of it's own problems. When they acquired Synference, that was one of the likely directions to take (along with offering bandits), but that didn't work out and those guys have since left.
Optimizely has a bandit based 'traffic auto-allocation' feature in production on select enterprise plans ; bandits are excellent in a wide range of situations, and have many advantages, but like anything, have design parameters and there are some caveats you have to be aware of to make sure you are using them effectively.
On Frequentist and Bayesian:
Optimizely's stats engine combines elements of both Frequentist and Bayesian statistics. They have a blog that tries to touch on this issue 
But this is subtle stuff - and there are a lot of trade-offs, and different perspectives; look at the Bayesian/frequentist debate which has been going on for decades among statisticians.
But, FWIW, I definitely saw Optimizely as an organisation make a big investment to produce a stats engine which had the right trade-offs for how their customers were trying to test; and I think the end result was way more suitable than 'traditional' statistics were.
"Traffic Auto-allocation automatically adjusts your traffic allocation over time to maximize the number of conversions for your primary goal. [...]
To learn more about how algorithms like this work, you might want to read about a popular statistics problem called the “multi-armed bandit.”"
"Yet as we developed a statistical model that would more accurately match how Optimizely’s customers use their experiment results to make decisions (Stats Engine), it became clear that the best solution would need to blend elements of both Frequentist and Bayesian methods to deliver both the reliability of Frequentist statistics and the speed and agility of Bayesian ones."
I didn't realize that the auto-allocation ever shipped, but I'm glad it finally did. Hopefully there was work done to empirically show that they solved a lot of the issues around time to convert and other messy parts of the data that killed earlier efforts, but I think everyone who knew about those was gone before you joined :)
There are very subtle issues with both frequentist and bayesian stats, which makes combining them sounds insane to me.
What are you up to these days?
Not that I'm trying to defend Optimizely (I'm not a huge fan, but for other reasons...).
I can't vouch for the quality either, but they did publish something about it - that at least looks quite scientific. Happy to read any critique of course.
They are still having people make very fundamentally flawed assumptions about the data, which results in incorrect conclusions, and they are still not presenting the results in a way that people correctly interpret them. That being said, those are really hard to solve, and models that would try to correct for them would likely require a lot more data and be overly conservative for more people.
What are your reasons for disliking Optimizely?
My issue with Optimizely are mainly how they essentially ditched us as (paying) customers. We were admittedly small-fish, but we were paying and were willing to pay more, but they switched to Enterprise-vs-Free without any middle grounds. Enterprise was way too expensive for us. Free didn't include essential features, so we were just stuck.
Should have everything you would ever want to know about the method.
I agree with you that the problem of inference and interpretation between A/B data, algorithms, and the people who make decisions from them is a hard one and worth working on.
That said, I do think the two sources of error our stats engine addresses - repeatedly checking results, and cherry picking from many metrics and variations - did make progress in having folks correctly interpret A/B Tests. This did result in more conservative results, but the benefit was that the variations that do become significant are more trustworthy. I think this was absolutely the right tradeoff to make for our customers, and trustworthyness is a pretty important aspiration for stats/ML/data science in general.
Of course I did write the thing, so I'm not very impartial.
Disclaimer: I do stats work at VWO, an Optimizely competitor.
(Also if you want to read our tech paper, here it is: https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic... This describes our Bayesian approach, which we believe to be less likely to be wrongly interpreted by non-statisticians.)
Years ago my team's statistician did a competitive review of various AB test apps, and reported various ways in which the UIs make statistically invalid statements to the user.
Being smug and condescending really backfires when you don't know what you're talking about.
> Being smug and condescending really backfires when you don't know what you're talking about.
How's that working out for you?
I'd serve them at the same time, randomize which clients see which one.
Simple example: you're comparing your standard site's theme with a new pumpkin orange theme. In October, the pumpkin orange theme might go great, whereas in December, it might perform worse. There's clearly a seasonality interaction you need to account for.
What you're thinking of is more like testing e-commerce button colors right after Christmas. After Christmas, it's likely that all your groups perform worse, but equally so.
You're now running an A/B for 1 week with option A and another version, option B. You get a conversion rate of 6.5% for option A and 5.9% for option B. Normally, you'd say that 6.5% is better than 5.9%. But how sure can you be if you don't control the other factors and both are performing better than before? How many visitors do you need to offset the influence of offsite factors?
In your case, option A started doing a lot better when you started running the test, which is moderately surprising, so maybe you messed something up in user assignment, tracking, or something. But if your framework is good, then it sounds like an external change. You still want to look at how A's rate and B's rate compare for the time period when you were randomizing, so 6.5% vs 5.9%, with the earlier 5.7% being more or less irrelevant.
You do need to do significance testing to see how likely you would be to get this wide a difference by chance. The easiest way to do that is to enter your #samples and #conversions into a significance calculator , and the p-value tells you how likely this is (assuming you didn't have any prior reason to expect one side of the experiment to be any better).
 Like this one https://vwo.com/ab-split-test-significance-calculator/ , but multiply the p-values it gives you by 2 to account for them incorrectly using a one-tailed test.
It seems like you could "bucket" results by week, to get some kind of date factor considered. Of course it's also possible your Christmas creatives are simply garbage and will never get promoted, even when they are in-season. ;)
For example - optimizing a content site for AdSense; many folks would gravitate to AdSense $$ as the target metric, which is admittedly an intuitive solution (since that's how you're ultimately getting paid).
But if you think about it....
AdSense Revenue =>
(1 - Bounce Rate) x Pages / Visit x % ads clicked x CPC
Bounce rate is binomial probability with a relatively high p-value (15%+), thus you can get statistically solid reads on results with a relatively small sample.
Pages / Visit is basically the aggregate of a Markov chain (1 - exit probability); also relatively stable.
% ads clicked - binomial probability with low p-value; large samples becomes important
$ CPC - so the ugly thing here is there's a huge range in the value of a click... often as low as $.05 for a casual mobile phone click or $30 for a well qualified financial or legal click (think retargeting, with multiple bidders). And you're usually dealing with a small sample of clicks (since the average % CTR is very low). So HUGE natural variation in results. Oh, and Google likes to penalty price sites with a large rapid increase in click-through-rate (for a few days), so your short term CPC may not resemble what you would earn in steady-state.
So while it may make ECONOMIC sense to use test $ RPM as a metric, you've injected tremendous variation into the test. You can accurately read bounce rate, page activity, and % click-through on a much smaller sample and feel comfortable making a move if you're confident nothing major has changed in terms of the ad quality (and CPC value) you will get.
The problem I frequently wonder about is that you have to assume independence about the stable variables to be comfortable testing them. In reality, the bounce rate of a people who make you lots of money is probably driven by different factors than the bounce rate of the overall population.
I guess what you should really do is optimize the bounce rate / pages per visit / etc. for just the population of people that could make you money, but you don't typically have access to that information.
As the parent poster says, its best to focus on individual funnel steps that provide fast feedback, at least initially.
Once the whole funnel is optimized (does this ever happen?), you could start feeding in end-to-end $$ metrics.
From the first link below: "They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.
However, there are bandit algorithms that account for statistical significance."
Even in the tests shown, conversion rate was higher for the MABA algorithms than simple A/B testing. "Oh but you get higher statistical significance!" thanks, but that doesn't pay my bills, conversion pays.
Those tables are showing the conversion rate during the test, up to the time when statistical significance is achieved. You generally then stop the test and go with the winning option for all your traffic.
In the two-way test where the two paths have real conversion rates of 10% and 20%, all of the MAB variations did win. Here is how many conversions there would be after 10000 visitors for that test, and how they compare to the A/B test:
MAB-10 1997 +9
MAB-24 2001 +13
MAB-50 1996 +8
MAB-90 1994 +6
MAB-10 1969 -18
MAB-50 1987 +0
MAB-77 1988 +1
(The third column in the above two tables remains the same if you change 10000 to something else, as long as that something else. MAB-10 beats RAND in the first test by 9 conversions, and loses by 18 conversions in the second test).
Just a note, don't literally do this:
I want an A/B testing tool that won't let you see results until it's done.
This suggest to me that, similarly to a lot of algorithms you might want to change your parameters during training
So start with MAB-100 (RAND) and then decrease that % over time
I find it funny when some people think A/B (or MAB) testing will solve major usability problems
You can definitely try Genetic Programming your website to conversion, it's probably going to be fun to watch
10% of the time, we choose a lever at random. The
other 90% of the time, we choose the lever that has
the highest expectation of rewards.
For example let's say over time the percentage of your traffic that comes from search engines increases. And this traffic converts better then your other traffic. And let's say at the same time, your yellow button gets picked more often then your green button.
This will make it look like the yellow button performs better then it actually does. Because it got more views during a time where the traffic was better.
This can drive your website in the wrong direction. If the yellow button performs better at first just by chance then it will be displayed more and more. If at the same time the quality of your traffic improves, that makes it look like the yellow button is better. While in reality it might be worse.
In the end, the results of these kinds of adaptive strategies are almost impossible to interpret.
By sending 90% of traffic towards yellow, it's ratio will normalize towards the 50% once it has enough traffic. By sending 10% of traffic randomly, eventually the green option will reach 51%, and start taking a majority of traffic, which then will cause it to normalize at it's 65%, and be shown to a majority of users.
I think the problem might be, if you run this with a sufficiently high volume or for a long period of time, that if a behaviour change takes place it will take a long time to learn the new behaviour. Or if two options aren't actually different, it may continually flip back and fourth between two options.
Also, to me, the concept of A/B testing certain things may also have an undesired consequence. For example, I order from amazon every day, but today the but button is blue, what does that actually mean? And I go back to the site later and it's yellow again. There are still many people who get confused by seemingly innocuous changes with the way their computer interacts with them.
Proper A/B tests are supposed to be done on a per-unique-user basis. If you access the shop from the same device or user account, a well-done A/B test should consistently show you the same interface.
Submission date is may 30, 2012
First of all, the suggested approach isn't always practical. Imagine that you are testing an overhaul of your website. Do you want daily individual visitors to keep flipping back and forth as the probabilities change? I'm not sure if the author is really suggesting his approach would be a better way to run drug trials, but that's clearly ridiculous. You have to recruit a set of people to participate in the study, and then you obviously can't change what drug you're giving them during the course of experiment!
Second, it ignores the time value of completing an experiment earlier. In the exploration/exploitation tradeoff, sometimes short-term exploitation isn't nearly as valuable as wrapping up an experiment so that your team can move to new experiments (e.g., shuting down the old website in the previous example). If a company expects to have a long lifetime, then, over the a time frame measured in weeks, exploration will likely be relatively far more valuable.
For instance, A/B testing with a 50-50 split has been baked into "business rules" framework from about as along ago as the Multi-armed bandit has been around, but nobody in that community has ever heard of the multi-armed bandit, and in the meantime, machine learning people are celebrating about the performance of NLP systems they build that are far worse than rule-based systems people were using in industry and government 15 years ago.
The statement is odd for two reasons. One is that plenty of NLP is rule based, the other is that NLP isn't a form of A/B testing, which is the overall topic here..
That was not my experience. I have been researching chat bots and it would seem I can hardly find one implemented with machine learning but instead almost all are rules based. I was quite disappointed. ML for NLP is just gearing up.
If you're constantly changing the button color, size, location, whatever... that is an awful experience in and of itself, is it not? If the Amazon "buy now" button changed size / shape / position / color every time I went to buy something, I would get frustrated with it pretty quickly.
Their experience will still change once their cookie expires, but that amount of time is completely under your control.
When your site is constantly changing in other ways, dynamically changing odds can cause a skew because you could give more of A than B during a dependent change, so you have to normalize for that somehow. A/B testing doesn't have this issue because the odds are constant over time.
Often it is a mix - you might use a bandit approach with your predictive targeting, but you also should A/B tests the impact of your targeting model approach vs a current default and/or a random draw. see slides 59-65: http://www.slideshare.net/mgershoff/predictive-analytics-bro...
For a quick bandit overview check out:
It sorta-kinda makes sense for the fancy stream of comments but still... why is it a requirement!?
If you are worried about it breaking in some browsers because you are modifying a tag that has not yet had its content completed so isn't accessible in the DOM yet, or because you use a DOM manipulation library that hasn't loaded yet due to lazy loading, have the script add an extra wrapper instead of modifying the existing tag [i.e. <script>document.write('<div id="iamanidiot" style="display:hidden">';</script>) directly after <body> and <script>document.write('</div>';</script> before </body>]. Or, of course, just don't...
visibility: visible !important;
Or that they have kept with the times and understand the notion of opportunity cost.
>Progressive enhancement is not hard.
No, just redundant and for marginal benefit. The train has long left that station.
Search Engines (well, those that matter) and screen readers for a11y both work with JS.
That's not keeping up with times, that's poisoning the well because others do it too. By making such choices a site author is making the Internet worse for everyone.
I'm starting to think there should be some kind of "how not to be an ass on the Internet" course that would be mandatory before you're allowed to do business on-line.
It's a required part of modern web sites -- and it doesn't matter whether it's "really needed" for any particular site or not.
edit: Ah, 'ReadingInBed beat me to it. tl;dr: Bandit approaches might not _always_ be the best, and they tend to take longer to reach a statistically significant result.
Its short, but it covers all the major variations.
For example, if your metric of success is that someone retains their monthly membership to your site, it will take a month before you start getting any data at all. At that point, in theory almost all of your users should already be allocated to a test because hopefully they visited (and used) their monthly subscription at least once. So it would be a really bad experience to suddenly reallocate them to another test each month.
I ran a few basic A/B tests on some handscanner software used in large warehouses. The basic premise is that the user is being directed where to go and what items to collect. The customer wanted to know how changes to font size and colour of certain text would improve overall user efficiency. But the caveat was that we had to present a consistent experience to the user- you can't change the font size every 10 seconds or it will definitely degrade user experience!
My point is that it sounds as though the 1-armed bandit will probably work great provided the test is short, simple, and the choice can be re-made often without impacting users.
For statistical power... "Statistical Power Analysis for the Behavioral Sciences" by Jacob Cohen.
It seems the conformal prediction link has some relation. I will dig, thanks.
A/B testing is comparing the performance of alternatives. Epsilon-greedy is a (good) way to do that. it's better than the most common approach, but you're still testing A against B.
There is no silver bullet, no free lunch. There is no algorithm that will beat understanding your domain and carefully analyzing your data.
It should instead say "Like many techniques in machine learning, the simplest strategy is easiest to implement" as the title of the post (20 lines) makes it clear.
For many problems in machine learning, k-nearest-neighbors and a large dataset is very hard to beat in terms of error rate. Of course, the time to run a query is beyond atrocious, so other models are favored even if kNN has a lower error rate.
It's not surprising that they found random forests perform better, even though the conventional wisdom is boosting outperforms (it's much harder to overfit with forests).
Not "impossible to beat" mind you -- merely hard. Because the simplest strategies work so well in so many cases, that you need to go quite subtle and implement harder algorithms (doesn't matter if you just take them ready off of a lib for the argument) to see an improvement. 80-20, et al.
It can be used from active learning to changing webpage layouts to increase ad clicks. It has the best bounds out of all exploratory algorithms.
Structured contextual bandits that come with LOLS (another algorithm present in vowpal wabbit) is extremely powerful.
All for free under BSD3.