Hacker News new | past | comments | ask | show | jobs | submit login

At the start of my career (late 70s), I worked at IBM (Hursley Park) in Product Assurance (their version of QA). We wrote code and built hardware to test the product that was our area of responsibility (it was a word processing system). We wrote test cases that our code would drive against the system under test. Any issues we would describe in general terms to the development team -- we didn't want them to figure out our testcases -- we wanted them to fix the bugs. Of course, this meant that we would find (say) three bugs in linewrapping of hyphenated words and the use of backspace to delete characters, and then the development team would fix four bugs in that area but only two of the actual bugs that we had found. This meant that you could use fancy statistics to estimate the actual number of bugs left.

When I've worked for organizations without QA teams, I introduce the concept of "sniff tests". This is a short (typically 1 hour) test session where anybody in the company / department is encouraged to come and bash on the new feature. The feature is supposed to be complete, but it always turns out that the edge cases just don't work. I've been in these test session where we have generated 100 bug tickets in an hour (many are duplicates). I like putting "" into every field and pressing submit. I like trying to just use the keyboard to navigate the UI. I run my system with larger fonts by default. I sometime run my browser at 110% zoom. It used to be surprising how often these simple tests would lead to problems. I'm not surprised any more!




> When I've worked for organizations without QA teams, I introduce the concept of "sniff tests". This is a short (typically 1 hour) test session where anybody in the company / department is encouraged to come and bash on the new feature.

We call those bug-bashes where we work, and they're also typically very productive in terms of defects discovered!

It's especially useful since during development of small features, it's usually just us programmers testing stuff out, which may not actually reflect how the end users will use our software.


A good QA person is basically a personification of all the edge cases of your actual production users. Our good QA person knew how human users used our app better than the dev or product team. It was generally a competition between QA & L2 support as to who actually understood the app best.

The problem with devs testing their own & other devs code is that we test what we expect to work in the way we expect the user to use it. This completely misses all sorts of implementation error and edge cases.

Of course the dev tests the happy path they coded.. that's what they thought users would do, and what they thought users wanted! Doesn't mean devs were right, and frequently they are not..


This dude gets it.


> It was generally a competition between QA & L2 support as to who actually understood the app best.

So true!


And to clarify specifically because those who haven't experienced don't understand it...

The uses of your app as intended by the authoring developers never matches the uses of your app out in the wild in the hands of human users.

Over time, power users develop workflows that may go unknown by dev/product/management and are only well understood by QA / L2 support.

The older the app, the more the divergence


Maybe we work in the same company. I'd like to add that usually the engineer responsible for the feature being bug-bashed is also responsible of refining the document where everyone writes the bugs they find since a lot are duplicates, existing bugs, or not bugs at all. The output is then translated into Jira to be tackled before (or after) a release, depending on the severity of the bugs found.


that's very interesting to hide the source of the automated tests from the developers as a strategy. I can see that shifting the focus to not just disabling the test or catering to the test etc. I'll have to think about this one, there's some rich thoughts to meditate on with this one.


It is an interesting approach I hadn't heard of before. For complex systems though, often reproducing the bug reliably is a large part of the problem. So giving the developers the maximum information is necessary.

Any time a "fix" is implemented, someone needs to be asking the right questions. Can this type of problem occur in other features / programs? What truly is the root cause, and how has that been addressed?


Wow, less IS more. Hear me out.

How do we measure “more” information? Is it specificity or density?

Because here, assuming they can reproduce, keeping the information fuzzy can make the problem space feel larger. This forces a larger mental-map.


At Microsoft back in the day, we called those “bug bashes”, and my startup inherited the idea. We encouraged the whole company to take an afternoon off to participate, and gave out awards for highest impact bug, most interesting bug, etc.


This is a bit of an aside, but I have a question that I'd like to ask the wider community here. How can you do a proper bug-bash when also dealing with Scrum metrics that result in a race for new features without any regard for quality? I've tried to do this with my teams several times, but ultimately we're always coming down to the end of the sprint with too much to do to implement features, and so anybody that "takes time off" to do bug bashing looks bad because ultimately they complete fewer story points than others that don't do it?

Is the secret that it only works if the entire company does it, like you suggest?

And yes, I completely realize that Scrum is terrible. I'm just trying to work within a system.


That's not a problem with Scrum, it's a problem with your team. If you're doing a bug bash every sprint, then your velocity is already including the time spent on bug bashes. If it's not in every sprint, you can reduce the forecast for sprints where you do them to account for it (similar to what you do when someone is off etc).

If you're competing within the team to complete as many story points as possible that's pretty weird. Is someone using story points as a metric of anything other than forecasting?


> Is someone using story points as a metric of anything other than forecasting?

Very nearly every company I've worked at that uses Scrum uses story points, velocity, etc., as a means of measuring how good you or your team are. Forecasting is a secondary purpose.


Don't teams assign points to their own tickets? So how could one compare the points between teams?


Yes. But many Sr. Leaders just see a number so it must also be a metric you can use for measurement. They do not understand it’s real use.

I picture a construction company counting the total inches / centimeters each employee measured every day. Then at the end of the year firing the bottom 20% of employees measured in total units measured in the last 12 months.


Sounds easy to game. Click a button to foo the bar is <pinky> one million points


Ahh the good old "Waterfall Scrum"


> That's not a problem with Scrum, it's a problem with your team.

I've seen that justification time and again, and it feels disingenuous every time it's said. (Feels like a corrolary to No True Scotsman.)

I've also seen scrum used regularly, and everywhere I've seen it has been broken in some fashion. Enough anecdata tells me that indeed Scrum, as stated, is inherently broken.


Ah the classic: How do I improve quality in an org 'without any regard for quality'? :)

But assuming that everyone cares about quality (I know, a big leap), what has worked for me is: tagging stories as bugs/regressions/customer-found-this and reporting on time spent. If you're spending too much time fixing bugs, then you need to do something about it. New bugs in newly written code are faster to fix, so you should be able to show that bug bashes make that number going down quarter over quarter which contributes to velocity going up.

Alternately (and not scrum specific) I've had success connecting a CSM/support liaison to every team. Doesn't give you a full bug bash, but even one outside person click testing for 20m here and there gets you much of the benefit (and their incentives align more closely with QA).


The team with the lowest bug bash participation this week is the victim err host of next week's bug bash.


Seems like another data point stating sprints don't make sense in real world projects ?


I'm kind of in the same boat re story points and Scrum metrics, but sometimes we can get management buy-in to create a ticket to do this sort of thing, if it's seen as high value for the business.


> because ultimately they complete fewer story points than others that don't do it?

Solution: don't measure story points


Only assign points based on a (n-1)-day sprint instead of a n-day one.


Why are you putting the blame on scrum if you don't even implement it? I did scum in a previous company and it worked fine. Nobody looked at the story points except the devs during planning. We had a honest discussion with the product owner every time and did find the time to do tech debt.

It wasn't perfect, but it worked well.

Granted, it required a very specific management, devs with the right mindset and constraints on the kind of projects that could be done (anything customer facing with a tight deadline was off for instance. We used that for the internal infra). So I don't see how you would build a plane at Boing with scrum for instance. Or anything that require very tight coupling af many teams (or hardware).

But for us (60 devs in a company of 200), Saas, it worked great.


We do a similar thing but call it a bug hunt.

Not only do we uncover bugs, it’s a great way to get the whole company learning about the new things coming and for the product team to get unfiltered feed back.


> I've been in these test session where we have generated 100 bug tickets in an hour.

Is that like… usefull to anyone? Especially if they are duplicates. It feels to me that 10 different bugs is enough to demonstrate that the feature is really bad, after that you are just kinda bouncing the rubble?


As noted in the post, one characteristic of a healthy QA environment at your work is how effectively you triage bugs. That includes detecting duplicates. One big QA smell for me is opening a team's bug backlog and realizing that there are shitload of dupes in there, because it means that no one really looked at them in detail.


It's "free" QA. And presumably you'll do it again later until it's better.


> This meant that you could use fancy statistics to estimate the actual number of bugs left.

That's very clever. Precise test case in QA plus vague description given to dev. Haven't seen it before, thank you for sharing that insight.


If there is a precise testcase, automate it. There real value of manual tests is whene they explore to find variations you didn't think of. Your manual tests should be explore x


Yes, automated tests are better than manual. The subtle point I'd missed is to consider not giving the dev team your automated test to improve the odds they fix the general pattern as opposed to your point instance.

The statistical side to it is really interesting too, still thinking about that.


Generally, "The German Tank Problem":

<https://en.wikipedia.org/wiki/German_tank_problem>

There are similar methods used in estimating wildlife populations, usually based on catch-release (with banding or tagging of birds or terrestrial wildlife) or repeat-observation (as with whales, whose fluke patterns are distinctive).


I've always been impressed by hardware QA test teams I've worked with. On Google Fiber, they had an elaborate lab with every possible piece of consumer electronics equipment in there, and would evaluate every release against a (controlled) unfriendly RF environment. ("In version 1.2.3.4, the download from this MacBook Pro while the microwave was running was 123.4Mbps, but in version 1.2.4.5, it's 96.8Mbps." We actually had a lot of complexity beyond this that they tested, like bandsteering, roaming, etc.) I was always extremely impressed because they came up with test cases I wouldn't have thought of, and the feedback to the development team was always valuable to act on. If they're finding this issue, we get pages of charts and graphs and an invite to the lab. If a customer finds this issue, it just eats away at our customer satisfaction while we guess what could possibly have changed. Best to find the issue in QA or development.

As for software engineers handling QA, I'm very much in favor of development teams doing as much as possible. I often see tests bolted on to the very end of projects, which isn't going to lead to good tests. I think that software engineers are missing good training on what to be suspicious of, and what best practices are. There are tons of books written on things like "how to write baby's first test", but honestly, as an industry, we're past that. We need resources on what you should look out for while reviewing designs, what you should look out for while reviewing code, what should trigger alarm bells in your head while you're writing code.

I'm always surprised how I'll write some code that's weird, say to myself "this is weird", and then immediately write a test to watch it change from failing to passing. Like times when you're iterating over something where normally the exit condition is "i < max", but this one time, it's different, it actually has to be "i <= max". I get paranoid and write a lot of tests to check my work. Building that paranoia is key.

> I like putting "" into every field and pressing submit.

Going deeper into the training aspect, something I find very useful are fuzz tests. I have written a bunch of them and they have always found a few easy-to-fix but very-annoying-to-users bugs. I would never make a policy like "every PR must include a fuzz test", but I think it would be valuable to tell new hires how to write them, and why they might help find bugs. No need to have a human come up with weird inputs when your idle CI supercomputer can do it every night! (Of course, building that infrastructure is a pain. I run them on my workstation when I remember and it interests me. Great system.)

At the end of the day, I'm somewhat disappointed in the standards that people set for software. To me, if I make something for you and it blows up in your hands... I feel really shitty. So I try to avoid that in the software world by trying to break things as I make them, and ensure that if you're going to spend time using something, you don't have a bad experience. I think it's rare, and it shouldn't be, it should be something the organization values from the top to the bottom. I suppose the market doesn't incentive quality as much as it should, and as a result, organizations don't value it as much as they should. But wouldn't it be nice to be the one software company that just makes good stuff that always works and doesn't require you to have 2 week calls with the support team? I'd buy it. And I like making it. But I'm just a weirdo, I guess.


> Going deeper into the training aspect, something I find very useful are fuzz tests.

Could you share some details of fuzz tests that you've found useful? I tend to work with backend systems and am trying to figure out whether they will still be useful in addition to unit and integration tests.


Fuzz tests are most useful if they are run continuously/concurrently with development. Doing it that way, a change or decision that causes a fuzz test failure hasn't been built upon yet. Imagine building an earthquake resistant house on a shaker table vs. putting the 100% completed house on the same shaker table.

Doing fuzz testing at the end leads to a lot of low priority but high cost bugs being filed (and many low-cost bugs as well).

The utility of it is quite clear for security bugs. It requires low effort to find lots of crashes or errors that might be exploitable. For development in general, it tends to identify small errors or faulty architectural or synchronization decisions very early, while they are still easy to repair.


Sure!

I wrote my first fuzz test with a go1.18 prerelease just to get a feel for the new framework. It found a crash instantly: https://github.com/jrockway/json-logs/commit/77c87854914d756.... This commit contains the fuzz test and the fix. (3 cases could be nil; I only ever thought about 2 of them. The fuzzer found the 3rd almost immediately.)

For my project at work, we have an (g)RPC server with a ton of methods. The protos have evolved since 2015 while trying to be backwards compatible with older clients. As a result, many request parameters are optional but not expressed as oneof, so it's hard to figure out what a valid message actually is, and different parts of the code do it differently. (As an aside, oneof is totally backwards compatible... but the protos evolved without knowing that.) I wrote a fuzz test that generates random protos and calls every single RPC with them: https://github.com/pachyderm/pachyderm/blob/master/src/serve...

This test is imperfect in many ways; it knows nothing about protos, and just generates random bytes and sends it to every method. Most of the time, our protoc-gen-validate validators reject the message as being nonsense, but that's totally fine. The test found about 50 crashes pretty immediately (almost null messages, basically a non-null message where the first tag is null), and one really interesting one after about 16 hours of runtime on my 32 core Threadripper workstation. Writing this and fixing the code took about a day of work. I just randomly did it one day after writing the call-GRPC-methods-with-introspection code for a generic CLI interface.

I did this concurrent with a professional security review as part of our company's acquisition and... they didn't find any of these crashes. So, I think I added some value.

It's worth mentioning that most people wrap gRPC calls with a defer func() { ... recover() } block so that bad RPCs don't crash the server. We didn't at the time, but I implemented the fuzz tests with such a wrapper. (So the test doesn't crash when it finds a panic, I just return a special response code when there is a panic, and the test is marked as failing. Then the next test can proceed immediately in the presence of --keep-fuzzing or whatever.) Either way, I prefer an error message like "field foo in message BarRequest is required" to "Aborted: invalid memory reference or nil pointer dereference." and now we have that for all the cases the fuzzer noticed. The user is sad either way, but at least they can maybe self-help in the first case.

All in all, I haven't spent a ton of time writing fuzz tests, but each one has caught some bugs, so I'm happy. A good tool to have in your toolkit.


At $website, we used to call that “swarm”. All features had to go through swarm before being released, and all product managers were made to participate in swarm.

Its demise was widely celebrated.


>Any issues we would describe in general terms to the development team -- we didn't want them to figure out our testcases

I’m sorry but this is just lol. Did the devs play back by creating bugs and seeing if your team could find them?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: