These also don't make much sense to me, but to avoid getting d/ved to oblivion I'll say why.Firstly both you and the author aren't really quoting any maths or real world papers or anything. He's backing his up with saying that all the advertisers are using this over A/B though, which is a pretty strong argument. But it occurs to me that for most of your points to stand you need to tackle this particular paragraph:Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method.So to tackle you points:1. Only stands if you show his paragraph above to be wrong. Does epsilon-greedy only work on consistent payouts, or does it work on fluctuating payouts too? It would seem to me that this would be a common occurrence in advertising on websites. I imagine there is some research out there to settle this!2. He addresses this directly in the post: This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)3. There is no difference between this and A/B testing, the mock code he shows is supposed to go in your A/B testing framework, the code in the controllers is supposed to be the same (and you can remove it the same way).4. Isn't A/B testing just as bad at testing multiple factors? Why wouldn't you 'notice', you should theoretically see the same percentages for each stage. And would be able to notice the oddity.5. Again this only stands if you show his paragraph above to be wrong. You are suggesting that a complicated strategy will win, which he says isn't true.
 I showed that paragraph to be wrong. Not slightly, but wildly in the common scenario where, while a test is running, you make another change to your website that lifts conversion.The difference in this case is not a few hundredths of a point of convergence. It is a question of potentially drawing wrong conclusions about the wrong version for 100x as long as you need to.
 What your forgetting is it's adaptive. That 10% random factor means it's constantly adding in new information. Also, you can graph trends over time so if you make a significant change you could reset the historical data to zero, but simply letting it run and it will adapt to the change.If your really concerned about rapidly changing events just add a diminishing return. AKA multiply both the success and failure number by say .9999 after each test. so 34/2760 = 34.9966/2760.724 on success or 33.9966/2760.724 on a failure.
 I am not forgetting that it is adaptive. I'm pointing out that the new information that is added will cause it to mis-adapt for a surprisingly long time. Adding in a diminishing return is possible. But by what factor do we diminish? We could easily get a disturbing amount of customization required.
 The time it takes to adapt is directly related to the magnitude of the difference if it takes six months to from 1.006% efficient strategy to a 1.007% efficient strategy it's not that important. The goal is to find significant wins quickly and any strategy that focuses on micro optimization will tend to find local maxima not global ones. If the top two strategy's are close enough this greedy algorithm will tend to bounce between them and that's ok.As to diminishing factor you diminish both the numerator and the denominator for a bucket every time you test that bucket. If you want something next to perfect try http://en.wikipedia.org/wiki/Bayesian_statistics, but that eat's a lot of CPU and is harder to code for minimal gain.
 I got a very different impression of this method than you did because I saw it as, just as with an A/B test, when you make a major change to the site, you reset your counters. This makes your point #2 mute. As for number 5, why couldn't you use the same cohort analysis with this method?
 moot.mute == "does not talk"moot == "of little or no practical value or meaning; purely academic."moot/mute is a tricky word combo because smart people can justify using "MUTE" instead of moot, and english is such a terrible language w.r.t. spelling there are no good clues to use one versus the other (and "moot" is far less commonly used).
 Off-topic: Most of the time, such grammar/spelling corrections get downvoted because people think they are nitpicking and don't add anything to the conversation, or are rude somehow, but as a non-English, I appreciate them a lot!
 What is a major change?I've seen minor tweaks to a form raise conversions by 20%. The person running a particular test may not even know that a particular change was significant, or even that it happened. The change could be as subtle as another running test realized that version x is better, interfering with existing tests.As for #5, with an A/B test you run into these situations, you're able to break down and crunch the numbers in multiple ways, and then have a discussion about how you want to proceed. But with a multi-armed bandit approach whatever complexities you have not thought of and baked into your approach, are not going to be noticed.
 That data from the 90% on slice X is valuable because you're trying to confirm, or falsify, that X is better, quickly. The strategy is "look closely at this pointy thing to see if it's a needle or a sharp piece of straw."
 But... "better" than the other choices, which are only getting 1/10th the traffic. The amount of traffic sent to a choice should be a function of how good you currently think it is, and how much more data you need to be sufficiently certain about that choice. So choices that have insufficient data should get more traffic, and choices that already have sufficient data to be sure they're worse than some other choice should get very little traffic. How much traffic they should get depends on how certain you are of stationarity (is that a word?).
 re: #3... you can also remove items at will. If you look at the counters and find option C is consistently the lowest performer, then you can remove it from the code and remove the counter. It doesn't mean cruft needs to pile up.
 But how do you know when you can do that? Particularly with the slower convergence issue, and the potential presence of confounding factors like #2?
 Do you attempt to control for seasonal effects at all in your models? I'm wondering if allowing for a dependence on day-of-week, time-of-day and other relevant covariates, has any benefits over tracking fluctuations in a generic way?
 1 is definitely an issue, but has a solution. I posted an article about this a while back with a simple example of the problem and a possible solution. The example is a staged rollout but it illustrates the same point. If you change proportions of tests over time as there are independent changes, you can get skewed results: http://blog.avidlifemedia.com/2011/12/23/advanced-ab-testing...
 2. If you change the thing you're testing, you need to restart the test. Otherwise, you'll have a meaningless jumble of results from two separate things.
 re #1,could this be solved with an exponential decay?aka, saying that a click from 1 month ago is less valuable than someone clicking today.By tweaking the decay you could change how quickly the algorithm will sway when the conversion rate changes.EDIT: I just saw that rauljara suggested this below: https://news.ycombinator.com/item?id=4040230
 I don't think any of those points are very true.
 I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
 What do you think of Myna, in these respects? Does it suffer from the same disadvantages as other bandit optimization approaches?
 Does it suffer from the same disadvantages as other bandit optimization approaches?Yes.That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.
 #1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.
 Wow, you convinced me...Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.
 Sorry, was on an iPad.Most or all of the points suffer from: * is that actually true? * does regular a/b testing not also face that issue? * was it suggested that you must "set it and forget it"? * are there no mechanisms for mitigating these issues? * would using 20% or 30% mitigate the issues? * are you not allowed to follow the data closely with the bandit approach?The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.
 Most or all of the points suffer from:Let's address them one by one.is that actually true?In every case, yes.does regular a/b testing not also face that issue?For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.was it suggested that you must "set it and forget it"?Not "must", but it was highly recommended. See paragraph 4 of http://stevehanov.ca/blog/index.php?id=132 - look for the words in bold.are there no mechanisms for mitigating these issues?There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn intowould using 20% or 30% mitigate the issues?That would lessen the issue that I gave, at the cost of permanently worse performance.The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)are you not allowed to follow the data closely with the bandit approach?I am not sure what you mean here.

Applications are open for YC Winter 2019

Search: