Hacker News new | past | comments | ask | show | jobs | submit login
Increasing rate of experimentation (arkid.substack.com)
69 points by ArchieIndian 3 months ago | hide | past | favorite | 14 comments

I'm a cofounder of a small (and growing :-)) startup called Mito [1]. We don't do A/B testing currently, for two main reasons:

1. We're a locally installable product with opt-in updates. 2. We don't get enough new users per week to make the turnaround time on experiments quick enough.

We're big proponents for local-first software, so though we might have a hosted offering at some point for users who prefer that, the locally-installable + opt-in updates are gonna be around for a while. These make A/B testing hard for obvious reasons - it's no longer just flipping a switch to get both sets of users on the same branch after the experiment terminates, nor is it as easy to randomize people into two different groups.

We also just don't get enough new users a week to make the experiments make sense. For the effect sizes we're hoping to measure, we'd have to wait a few weeks to to draw conclusions - and a given that it's even more complicated to run more than one experiment at once (given the above local install + opt-in upgrading mentioned), it becomes really expensive to do any sort of A/B testing.

There's also the question that this post leaves out: what changes are worth A/B testing in the first place?

I know folks at a gaming company that A/B tests every single change they make to their games for the effect it has on rev/user. They are a mature company, and the changes they make to their apps are, on balance, fairly small. This makes sense to me, especially given the tooling they have built to facilitate this - although it is clear they are just trying to extract value from their existing user base rather than dramatically improve their product and grow.

For early-stage startups, IMO often the best bet is to be a user of your own product, and to just test your product changes on yourself - it's usually pretty obvious which direction you should take things. We recently overhauled our graphing capabilities to add about 5x more graph types and actual graph configuration options. Given how limited our previous graphing capabilities were, was pretty much a no brainer and obviously better. A/B testing would have just been a waste of time, methinks!

Feedback and thoughts on the above greatly appreciated. We're always looking for ways to improve our product/technical processes!

[1] https://trymito.io

Going beyond A/B testing in the context of websites (for customer conversion purposes), I think that having a bit of statistical rigour is something many marketing teams lack as part of their practices. My own marketing team brought me a chart showing that our sessions were down and I asked them, "is it statistically significant?" They had no answer for that. So I asked them to compile several years of data so that we could do a bit of analysis and find out some basic metrics like the mid-quartile range of values over a long time period.

If you have very little data, which is the case in many B2B marketing scenarios where your customer base may be in a niche, it's super critical to avoid making conclusions from averages without paying attention to the distribution, which might be very wide.

> You can’t improve what you can’t measure.

I can't stand this cliche. You can improve things you can't measure, we do this all the time, and the things worth improving most are often hard to measure. Setting objective goals are important to keep us honest with ourselves, but in that same vein, we should continuously acknowledge that, unless we are working in hard sciences, we are usually measuring proxies to what we really care about, and it's often hard to pick a representative measure.

As an example:

> Measures can be as simple as number of experiments per person per team or something more complicated.

I've seen this exact scenario in a large company I used to work for. The outcome was that some teams would hit their goal by running garbage experiments. It was a net negative for the product, because garbage still occasionally shows statistical significance. Acknowledge that the measure is an imperfect proxy, identify in what ways the measure could fail to represent the true desired outcome, and control for those (in this case, e.g. some oversight on experiment quality).

Goodhart's law[0] should be posted above every experimentation team.

The idea that "When a measure becomes a target, it ceases to be a good measure" is something that every data scientist/statistician knows, but almost none heed in practice.

The worse lead companies that I've worked for are the ones that claim to be "data driven". Countless dashboards showing various progress towards various targets without even a hint of understanding what the big picture might even be.

One of the biggest insights I've had over a career working in data science is that the person solving a problem based on years of experience without any numbers backing their decisions almost always is making choices close enough to optimal that it isn't worth the extra energy to push it optimal.

An example is that the person selling hot dogs at the park is probably pricing them nearly optimal. You could bring in a team of dynamic pricing experts, build a data center to mine costumer data, and I'm willing to bet a few hot dogs that difference between the model optimal price and what the hot dog vendor is selling is not enough to justify the cost of figuring out the difference.

I likewise would not be surprised if the real, long run benefit of A/B testing does not justify the cost of both employee time and especially the SaaS products that help manage these processes... but let's not do that analysis because my salary depends on no one checking this.

0. https://en.wikipedia.org/wiki/Goodhart%27s_law

The main reason that I think I make good decisions in the absence of data, is because of how much I've relied on whatever data is available to inform my decisions and learn going forward. This is a huge advantage to multivariate testing as a practice/culture. As a consequence, it's often very easy for me to pick out when readouts are giving a deceptive answer (i.e. oh, the scope of this uplift is too much, we need to double check if something happened to negatively impact the control).

I'm not sure I'd agree that people are often operating "close enough to optimal", but I would definitely agree that integrating experimentation is hard enough that sometimes the effort (or mistakes) you can introduce will cause more problems than you're helping. But I think this is more a function of how poor people are at the mechanics and the mindset of running experiments than that they're doing good enough pricing hot dogs. Experiments in many places are looked at for either CYA or boasting about quarterly results and not to truly learn/grow/improve.

Hofstadter's law is the only law that's self referential and practically all of them should be.

    It always takes longer than you expect, even when you take into account Hofstadter's Law.
Goodhart's Law is describing a system that is dynamically stable, like a unicycle. The moment you stop moving, you fall over. And while really clever people can reduce that movement to a very small extent, possibly so small that an untrained observer no longer sees it, for most beginners and journeymen it's easier/safer if you just move a lot.

In the same vein, I never achieve goals I set. Working out is the usual example because it's so easy to see and measure results. But if I do something casually, it improves. If I set something as a goal and try to stick to a program, it always fails due to injury, overuse, or not enough load/volume.

You can improve things you can't measure, we do this all the time

I would still consider an indirect measure or proxy measure of a thing to be "a measure of the thing".

Yes if you interpret "measurement" strictly then you can't measure anything at all. I don't even have a ruler I can measure the length of something with that isn't in some way a proxy for the true length.

The error bars of our measurements may be large, and the there may be many confounding variables to the measurement, but it is still a measurement.

Having said that, I agree with the overall sentiment above. We often pick metrics that don't measure what we think they do or do it so inaccurately as to be useless. And at the same time we invest too little in finding the right metrics before they become ingrained in the org structure.

I improve my cooking all the time. That doesn't seem to fit here.

what does word "improve" mean if you do not measure anything? How would you know that something is better/worse? (no metric is perfect, there may be downsides but "we don't need no metrics" attitude makes no sense--it leads to burning witches in an attempt to improve the weather)

I think you might be confusing "measure" with "observe". Measure in this discussion has a distinct quantitative implication.

You've never had one steak you prefer to another without doing quantitative analysis on it?

I love making cocktails and I can definitely tell better from worse, noting qualitative differences (a bit dry, too sweet, mouthfeel a bit thick) with absolutely no quantitative measurements involved.

I can certainly tell the difference between a good and bad violinist without measuring anything.

From a different perspective: many of my favorite movies have a 3.5 star rating on Amazon, in this case the measure does not correlate well with my sense of improvement.

Try a blind test for your cocktails (you could follow https://en.wikipedia.org/wiki/Lady_tasting_tea ). The results may surprise you.

Surely burning witches to improve the weather is exactly an example of metric use gone awry? To your point though, it is exceedingly common to know we are better at something without a clear metric. It is not controversial to suggest that someone may be better now at communicating than they were as a child, even though there is no clear way to define this.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact