Hacker News new | past | comments | ask | show | jobs | submit | Bartweiss's comments login

> in both cases someone could accidentally optimize the test

I think this is what I disagree with.

The water heater story is about a viable-for-market design which also optimized for the test. The equivalent for a car emissions test might be optimizing the transmission to reduce emissions at the specific speeds which will be tested. Those speeds could be sweet spots of the engine curve by accident, or they could be planned that way. I don't think that's necessarily right, but it's within the bounds of "natural" design for the product.

Instead of doing that, VW submitted something for testing which was fundamentally different from what went to market. Rather than being misleading, the test results were fundamentally irrelevant. Creating two completely different modes of behavior isn't something you could do by chance, and it means there's no real limit on how badly they could cheat.


This is omnipresent even where regulators aren't involved: every graphics card benchmark out there is 'manipulated' relative to real world performance. At this point it's so universal that I don't think anyone is even fighting it - as long as everyone games benchmarks roughly the same amount, the relative scores stay usable.

Your point about fairness and passive design is the one that makes me view these cases differently also. In the anecdote, the product being tested was the same one being sold, and there's no sign the heater was worsened to improve test performance. The designers just picked the best-scoring option among some reasonable configurations. (Frankly, once they noticed that issue, what were they supposed to do? Pick the worst-scoring, or pick the spec out of a hat?)

In the VW story, the test-bench vehicle was fundamentally different from the market vehicle, and the road version was designed to behave worse on the metrics to get other gains. I happen to know someone who bought a diesel Jetta specifically because it was more eco-friendly than other options, and I think he'd draw a clear line between tuning for test metrics and VW consciously lying to their buyers.


It's interesting you mention graphics cards because that very behavior has lead to the gaming community favoring benchmarks derived from a handful of current gen games min/max/avg FPS over so called synthetic benchmarks. It only took a handful of instances of companies baking in "benchmark" modes that get triggered when certain benchmarks are detected for people to start discounting those benchmarks in favor of more organic measurements.


I do think that manipulating a purely instructive measure is less extreme than manipulating a compliance test; consumers can seek alternate tests and reviews, but the state emissions test has special status even if a dozen other tests give a different result. That said, I believe Energy Star ratings affect tax rebates and electric bills, and they're required to be printed on products - so that's not really an arbitrary test.

There are other differences here too, I think. The water heater trick is passive manipulation that stays in place at all times, which limits how far from "real" performance it can get. And per the story, it seems more like "teaching to the test" than "cheating". That is, Volkswagen consciously moved away from the mandate outside of testing. The water heater was (potentially) as energy-efficient as they could design, with the test score manipulated on top of that.

None of that makes it harmless - if "as good as you can make" doesn't hit standards without manipulating them, that's still a problem. But I do find it less galling than "intentionally worsens emissions outside the test bench".


The flip side of the water heater test is, you could game the test the other way too. Making your water heater look worse than it is. Would you do that? No.

The difference between the water heater and VW is the water heater manufacturer is providing a representative sample. And VW was not. It'd also be dubious to say that the water heater company is acting in bad faith. Where VW's bad faith rose to the level of criminal. On the other hand Volvo appears to be acting in strictly good faith.

Bad faith for a crash test would be crafting a silver plate model for testing. Reminds me that's what my uncle said the power supply manufacturer he worked for did.


Crash test dummies have basically this problem also. They're designed for realism in certain very narrow ways, and then the very small number of approved dummies are used for testing car safety.

The industry has made a bit of progress, surprisingly unprompted by regulations - female and child dummies came into circulation before they were required in tests. But overall, testing is still run against a tiny handful of body types which move 'realistically' in only a few regulation-guided respects.


I think some of this falls into the simulation paradox: the more accurate the simulation, the closer the simulation is to the thing being modelled. But it's a quadratic relationship in most cases, so at some point meaningful increases in simulation accuracy cease to be economically viable.


Yeah, but in the words of RyanF9, "The US government can afford a BB gun", so there's no reason that DOT can't test helmet visors.

The main reason the DOT standard is so bad is because its mired in bureaucracy and managed by a severely underfunded organization.


Citing Maven also feels a bit circular. It's an important Java application, but being a build tool it's only because there's lots of Java out there to build.

Minecraft and a lot of the other apps are terminally impressive, so it's easier to justify the ecosystem that produced them.


Wait, which countries are we referencing outside of those three?

Thailand looks straightforwardly exponential so far and has fairly heavy mask use, agreed. But Singapore, Taiwan, and arguably Malaysia seem too early to call: they're still plausibly on either of a European curve or South Korea's ramp-then-flatline.

Vietnam, Cambodia, Laos, Mongolia, and Burma all seem to be below the line for meaningful data. And Hong Kong isn't broken out. So I guess my questions are: do Indonesia and the Philipines have "mask cultures" to a level comparable to South Korea and Japan, and are their testing regimes wide enough to rely on those curves?

I don't know the answer to that. And I agree that the "masks work" graph/meme circulating is questionable. But unless I'm missing something/somewhere, this data just looks like "too soon to call"?


> you had to be logged in to the web interface already with another account

Obviously I don't know specifics, but if this applies to any router which has multiple tiers of login then it could be a pretty serious problem. I suspect that might be true for routers designed specifically broadcast multiple networks (e.g. school or shared apartment-building routers)?


I don't think this is uncharitable at all. I'm sure Kinsa has made a good effort at controlling for testing frequency, and I'm sure it's helped. But there's no reason to think the dynamics of COVID-motivated testing are the same as for flu-season, or new-buyer novelty, or anything else.

And more importantly, how could we know if it is? That's not just a Kinsa problem; we see this over and over again with peer-reviewed studies that "control for" certain factors like socioeconimics or health history. They're inherently limited to controlling for what they know about, and it's never perfect. Often, the entire effect is from an undiscovered variable. Take, say, the widely-promoted study finding that visiting a museum, opera, or concert just once a year is tied to a 14% decline in early death risk. The researchers tried to control for health and economic status, then concluded "over half the association is independent of all the factors we identified that could explain the link." [1]

Now, what seems more likely: that the unexplained half is from the profound, persistent social impact of dropping by a museum or concert once a year? Or that some of the explained factors like "civic engagement" can't be defined clearly, others are undercounted (e.g. mental health issues), and some were missed entirely?

I suspect Kinsa did much better than that, because they're not trying to control for such vague terms. But I think "even after controlling for" should basically never rule out asking "what if it's a confounder"?

[1] https://www.cnn.com/style/article/art-longevity-wellness/ind...


There's also a fairly good argument for this in the line of trains and highways. Planes aren't physically trapped on one course, but pretty much every nation heavily regulates who can fly where, when. Airports are often state-controlled, and even private ones need state approval to add new runways or flights.

What we have now is one of the ridiculous "private non-market" arrangements. When airlines in Europe fly empty planes to stop the government from taking their flight slots away, that's not the fault of the companies, but it's also not a functional market we should expect efficiencies from.

I'm not a fan of "regulate markets into dysfunction then nationalize them", but if the fundamental restraints on travel are too severe to let the market function freely, privatization stops making much sense.


Privatization also stops making sense when private companies constantly need bailouts.


Good point.

The TARP bailout in 2008 involved buying a ton of stock from troubled companies, but it was sold back to them as soon as they could buy the money back. And this will be the second bailout for a bunch of airlines.

So one of the most interesting ideas I've heard is that we shouldn't nationalize things by fiat, but when TARP-style bailouts happen, the government should just keep the stock, at least for a while. If it really was a one-off crisis, the shares are a good investment. But if it's a failing business, or one paying dividends and then looking for handouts, it's not just a money sink.


The third option is "because they don't want to be blamed for model error". Governments aren't necessarily competent, but you can try to get them to understand 5%/95% confidence intervals, at least in hindsight. If you publicly release a prediction, and then the real outcome is the 10% confidence line, you're probably going to be yelled at for being wrong regardless of the error bars.

Of course, if the center of the prediction is horrifying, "people don't understand confidence intervals" then becomes a case of avoiding societal breakdown.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: