Hacker News new | past | comments | ask | show | jobs | submit login
What Test Engineers Do at Google: Building Test Infrastructure (googleblog.com)
289 points by kungfudoi on Nov 18, 2016 | hide | past | web | favorite | 77 comments

Summary of the article: At google-scale, integration tests take too long because it triggers so many service calls. So the solution is to use static hard-coded mock data. But also create tests to verify the mock data with real service calls. So previously, if you had N tests calling a rest service A, it would result in N calls to service A. But now, those N tests go to a hardcoded mock data, and there is a single test verifying the mock data by calling A. So N tests can run fast and there will be only a single call to A to verify the mock.

However, that would be the best case, which is the mock data is sufficient for all N tests. That is not likely to be true, but probably a M number of mocks can still cover all N tests. And M < N in most cases so there's still savings in number of service calls. In the worst case, N tests will need N mocks and it will be the same as before.

In a large graph you would re-test the same part of the code over and over, because many top level inputs is going to do trigger the same code-path deep down in the graph, which is wasteful. In a sense they memoize service call answers, so any tested code-path only needs to run once.

If they actually do this across their code they will never have to run "real service calls", because those services are also tested the same way. It's mocking all the way down.

Paraphrasing the famous quote: There are only two hard problems in CS: naming things and cache expiration.

An issue I see with this is that there's a potential window when tests pass with bad data. For example, tests using a mock and the mock is periodically verified with a service call. The mock could become bad data, but won't be marked as bad until the next service call verification. Until that happens, the tests using the mock will all pass. It's not clear from the article how they address this.

... and off-by-one errors.

Perhaps Google tracks the changes to the dependencies, and reruns tests against a a real service when it changes.

That's how I would do it, at least.

If you have task-oriented build system with dependencies (like Google does with Blaze), it'd fit right in.

Though depending on how you deploy, you may have old and new services at different versions. Assuming a level backwards/forwards compatibility may be reasonable.

I imagine that if the service call to verify mock data fails, it should retroactively invalidate all tests based on that mock data, or at least this should be understood to be the case by the people reviewing test reports.

The important part is that you are able to temporally decouple running the tests against the service mock, and the tests running against the real service. This sounds very similar to Consumer Driven Contracts, such as https://docs.pact.io/

You record your expectations towards the service mock, and then later verify that these expectations hold against the real service. This decoupling allows you to run your "end-to-end" test suite very often, say on every check in, and then only verify the contract say every hour or day, depending on how stable your test environments are. The theory is that this gives quicker feedback, reduce flakyness and avoid combinatorial explosion of number of tests needed to run.

I have always thought this is how mocks really should work - basically a two way stub.

Mocks are usually (inevitably?) a per-test structure which makes them outrageously fragile and tempting to over mock to get a passing test

I prefer building something that acts like the mocked thing (ie a database) but does So both ends (it looks like a database to the web service and a web service to the database. Then the stub is my contract

It's awkward

> In the worst case, N tests will need N mocks and it will be the same as before.

Well it's still better because you only have to run the services you are actually testing right?

Yes, though presumably some extra services isn't much overhead. The transitive closure of services is.

> In the worst case, N tests will need N mocks and it will be the same as before.

This very much neglects the probably significant overhead of writing, maintaining, understanding the M mocks on top of the base of N tests, but since, as you said, this is "at Google scale" the trade-off seemingly becomes worthwhile

This should be at the end of every article.


I am writing a book on software development (current title "The Software Mind") and I am more and more concluding that the vast majority of modern software development is scaffolding around the actual thing you want to do.

Writing tests is not something you knock up during coding, it needs infrastructure - a test suite, some test data, a Jenkins server and then whoops we need better test data and ...

This touches that feeling - I guess I need to write it down a few more times ...

It seems a bit like that Kent Beck quote, "make the change easy (warning: this may be hard), then make the easy change."

This is absolutely true. It's an observation many of us have had. There are people that are trying to develop startups to make it easier, but the combination is hard.

I think the next successful framework will focus on the entire lifecycle and aim to solve the scaffolding problem.

Maybe the language/framework/GUI could generate or require unit/integration/system tests (with just a few clicks to confirm or correct an assumption), and visualize in broad strokes which scenarios or input ranges are "green" yet.

That would mean a lot more infrastructure (sorry if that wasn't what you meant), but automatically provided or accepted by everyone involved.

In my experience end to end tests are often difficult to write tests for. It's been difficult to maintain frameworks that model arbitrary asynchronous end to end tests, the test frameworks often tightly mirror the use cases. I haven't really seen any good tools that allow non-technical testers to write complex end-to-end tests. This is a shameless plug for a project that is hardly more than an idea, but to address this, I am trying to create a flexible end to end test framework that deftly handles asynchronous state transitions and allows non technical people to write tests in a declarative way. It has not even reached functional proof of concept state yet :(


The goal is to model end to end tests as state machines and define conditions to trigger transitions. There should be a code based interface allowing for arbitrary transitions using any go library, and a higher level interface, allowing a handful of operations to allow non-technical testers to model end to end tests.

I wonder if such a process could be done without a "non-technical friendly" DSL. Teaching people to speak to computers through declarative languages is programming in one shape or form no matter what (IMO).

I wonder if you could record a user action, and add variance to the recording as a test? Break a recording into ~5 actions and then perform those actions at various speeds and points of repetition.

A recording of

1) Move mouse from A to B

2) Click button

Produces a range of tests including

1) Move mouse from A to B

2) Click button 3 times

This is basically what I did with http://hitchtest.com/

The declarative language is a readable YAML, which translates each named step into a python function call with arguments (e.g. - click: submit button).

The asynchronous part is 'under the hood' where it runs services which log lines of JSON. Those are watched using epoll triggers and parsed into objects which can be verified by one of the steps (e.g. check email containing name of user is sent).

I wouldn't say that it's necessarily possible nor desirable for a non-programmer to write the stories using YAML, however. Even with a perfect declarative language you need a mindset that requires precise thinking in order to write executable stories and that usually means programming ability. Stories also need refactoring as much as code does, alongside engine code.

However, the YAML ought to be useful as a way programmers can collaboratively write and refine stories with product owners/managers and should serve as a useful way to communicate back changes to the product that a non-programmer can understand.

Sounds very interesting. Could you maybe do a blogpost and explain how this is supposed to work?

End-to-end testing is super important, but still hard for a lot of people/companies/startups.

Using stubbed or mocking based approach can only go so far, despite it being less "brittle" than end-to-end tests. That is why our team built PANIC, which is a distributed testing framework for the every-programmer ( https://github.com/gundb/panic-server ). Everybody should be able to do tests like this, not just the Googles.

end-to-end tests are only brittle if your end-to-end-system is brittle. That's not a problem with the tests.

It's the natural thing for services to be brittle. We're all human after all: a mistake in recruitment, hiring, training, scoping, managing, documenting, or etc. can cause an engineer to make the mistakes needed to make a brittle service you might rely on and that's something I think we can all agree is something you just need to accept as "going to happen."

In that case it's worth dealing with it as it is not how it should be.

Not quite, the tests themselves are another layer in the call stack.

All things being equal,

A depends on B depends on C

is more brittle than

A depends on B

So if you have

Tests depends on A depends on B

It's the same as if you added a layer of complexity :)

This is my first exposure to GunDB.

Do you really do conflict resolution via alphabetic sorting?

Lexical sort is the last resort of the CRDT, not the first. It is necessary as a worst case scenario though because it is purely deterministic and guarantees convergences.

Also, you can have an append-only log where you can run your own conflict resolution algorithm or let manual/human resolution occur on top of it. So all in all it provides the most flexible foundation to build other things from, while still guaranteeing strong eventual consistency.

Testing all edge cases is important to us, which is why we built PANIC to verify our algorithms. Hopefully we'll be posting more results soon!

At our company developers rarely write tests. They rely on the QA for testing. Obviously it doesn't work, like, at all. I think a key aspect of this post is to highlight the fact that testing is in fine the developers' responsibility. Test engineers are here to make it easier for them to write tests.

There are levels and levels at which companies engage in Q.A.

1. There is none, and it's awful, once there are more than two developers.

2. There are dedicated QA people, who click on things in a more or less formalized fashion depending on the company, and sign off on releases. This is "pretty bad" at most places once you have a product that can't be easily verified by a couple of people in a few hours

3. You have dedicated Test Engineers, who are just software developers who write test automation. This can be good, but usually ends up with mid-low quality engineers filling this role for a variety of reasons.

4. Test Engineers are just Software Engineers who specialize in test-automation work. They write automation in high bug areas, and keep a larger overview of the product (set) to identify problematic areas and work with the product developers to solve those problems.

5. The article. Test Engineers write infrastructure for testing and reporting, (and possibly proof-of-concepts). Product developers are responsible for their own quality. Test Engineers may also act as consultants for product teams.

It's probably worth pointing out that 2-5 are all reasonable places to be for different sized companies, at different stages. We shouldn't assume that everyone should be at a company that is 'like Google', since things that make sense at Google scale may not make sense for your 20 developer startup.

edited for formatting only

#5 is all very well, but I don't think helping devs write test is a substitute for actually having QA people. Testers should have a mentality of "I want to break this", and like it or not, the developers just won't think like good testers.

Do Test Engineers at Google do this kind of work as well?

To add another question from an sre (swe), tools and infrastructure is a subset of my daily responsibilities (60%). Reading the couple comments it's hard to see where the line falls between these two roles. I'm on a smaller team so perhaps I just find myself handling both roles or.. ?

Well said, I've been a software developer in test for several years and number 5 is definitely where you want to get to if you have the resources. Very few do from my experience but when they do it's really beautiful and I'm not surprised Google is one that manages to do this. Half my job has become evangelism to get to this point now, half is development.

Also (half important point, half plug) solid test results reporting is underestimated in importance and generally not done well. Doing that right can boost engagement with testing. A reason that Tesults (https://www.tesults.com) exists and I'm involved with it.

I think scale has a lot to do with it. Google has enough money, know-how, and headcount to invest in good testing technology. Smaller companies are probably lacking in at least one of those categories. For example at my last company hiring QA interns was cheaper and faster than developing good testing environments. Of course it started to come back to bite them but at the time it was hard to construct an argument for anything else.

> At our company developers rarely write tests.


You could like, be the guy who writes a few unit tests.

Is Test Engineer different from SETI (Tools and Infrastructure)? What does SETI do?

Search for great tools and engineers on far away, exotic planets :-)

The begs the question: can we SETI at home?

I believe SETI is the new name for what was formerly Test Engineer. I think they changed the name because it was a better description of what they do. I suspect it also made it easier to hire for, although that might just be due to the aforementioned reason.

This is based off what I was told when I interviewed there a year ago.

SETIs are what SETs used to be. TEs are different. So SETI == SET but SETI != TE and SET != TE

I appreciate the correction and apologize for misinforming people. Unfortunately I can no longer edit my original comment

Yeah no problem. Just wanted to clarify. The vocabulary is confusing.

From what I understand, SEs in Test are now called SETI, but I could be wrong. It sounds like SETI does all the things the test engineers do in Google (I interviewed with SETI a half-year ago).

Yes, they are. That's why we started this little series of posts to highlight the difference. There already are some articles on SETI (aka SET). For example https://testing.googleblog.com/2016/03/from-qa-to-engineerin...

While it took a day to build a proof-of-concept for this approach, it took me and another engineer a year to implement a finished tool developers could use


I believe you are missing a link to "This!"

I gave a presentation called Testable Software Architecture [1] a week ago with very similar recommendations to this article.

Decoupling is essential in order to have fast, maintainable tests that give you confidence to deploy continuously.

My two favourite techniques for this is a ports&adapters architecture where we plug in fake adapters for the majority of the tests. We then use contract tests to be confident that the fakes behave the same as the real services.

[1] https://skillsmatter.com/skillscasts/8567-testable-software-...

Funny this should appear because I just started at a company today where a big part of my job will be to help the team work with big messy monolith of code that is difficult to test.

Can anyone here recommend further reading on this particular problem?

The book "Working Effectively with Legacy Code" by Michael Feathers.

Seconded. Some of it feels a bit dated, and it doesn't cover tools you can use like PowerMock if your codebase is Java and you really just need to get a piece of hairy code under coverage as quick as you can without having to refactor it all to be testable the 'normal' or 'clean' way. Nor does it cover refactoring for a functional and immutable approach (its approach is much more OO). Still a great resource to have on hand and be inspired by.

The most important thing is to follow its advice and just start doing it little by little, even small refactors of other people's code. A friend's company had a Book Club and some testing book got on the list at one point, and as they were discussing it they kept having arguments over how effective various things were, so they resolved the arguments by starting a Testing Club where every week everyone would put in at least an hour into getting some part of the system under test, or trying out a testing technique, and discussing it. Over time they got most of the product under test and fixed a lot of previously unknown bugs.

This might not exist, but could you recommend a text that _does_ cover refactoring for a functional and immutable approach? I would be interested in reading something like that.

Some feedback for the OP:

1. The tester and another team member spent a year developing something that would intercept calls and relay them. Two problems with that: (1) two person years spent, (2) and that sounds like serious NIH (not invented here) syndrome. The problem that should have been solved was everyone spending the time to write better tests and changing code as needed. Instead, they spent a year on a workaround, invented in-house. Was there not anything else out there that did this?

2. The word is "focused", not "focussed".

3. Lack of detail: how exactly does it work beyond that basic diagram?

4. Where's the code for the project? Would it be useful to others?

However, I admire that the OP posted their experience, and it is useful information.

I believe "focussed" is acceptable, although "focused" is preferred in the US.[1][2]

Can you explain why this sounds like serious NIH syndrome? It looks like they built a system to cache service requests on top of an existing test framework. It seems specific enough that there might not be an existing method that fit well enough. The article is a bit light on details though, so I suppose it's hard to tell.

[1] https://en.wiktionary.org/wiki/focussed [2] http://dictionary.cambridge.org/dictionary/english/focused

OP here ;-) 3 and 4: The article isn't at all about the tool we built, it's supposed to shine a light on the different kinds of tasks that Test Engineers do. Think of it as not very subtle job advertising.

1. The idea was definitely out there (some other commenter posted a link to pacts, which were a strong influence). Part of the process (and I hope this comes out in the article) was to try to find good ways to write better tests. We couldn't really (long story, let's just say "legacy code" to summarize it). So in the end we went for this technique. The whole process took a year to do, and as far as NIH goes, the existing implementations do not work in Google's setting. So we had to roll our own implementation. That did not take very long, though.

2 man years is nothing at Google scale.

You're asking that every engineer at Google spend more time writing better tests? What's the math end up saying there? That if each engineer spend more than X minutes per year writing better tests, there would be an ROI. Where X ends up being a comically low number.

And you'd still probably want caching system at scale anyway, esp. given a monorepo. Think compute hours saved.

NIHS, as you're calling this, can often make sense at scale.


We've asked you already not to post unsubstantively, and you've continued to abuse the site by commenting primarily ideologically. That's not what this site is for, so we've banned this account.

This is a tech-focused link aggregator. Googles a pretty big player in the tech industry, so it seems natural that at some point there may be many google related articles about.

You would know this isn't "fake news" if you spent the time it took to write your post to visit the link instead.

Could this be one of the most meaningless and trite jobs whilst still maximizing prestige? How many people will work this job in despair? How many intelligent people will spend years coding the obscurity that is test infra?

Please stop posting unsubstantive comments to HN.

He/she did address an interesting and touchy subject of software development. It could have been more diplomatic, but then also, I'd say, needlessly watered down.

A challenge in software development has long been the division between test and development.

You could do as Microsoft recently did (more or less abolished "test").. the jury is still out on whether that is what has caused the recent Windows 10 quality issues.

Or you could keep a tester/developer separation. Good luck trying to recruit top (e.g. your testers should on average be as smart as your developers) people for the tester positions unless you are Google/Facebook.

Either way, I think this is a really interesting issue.

It should be noted: some pieces of software are a lot easier to test by its developer (say, a compiler) than others (say a GUI).

If a person has a good point, it's not like it's hard to abide by the rules here. All it requires is the intention to. Your comment does a fine job.

A glance at a few (edit: random!) users' comment histories is enough to see that this intention (either way) is mostly orthogonal to the topics at hand.

The person is very depressed.


I'm just more interested in calling out drudgery, propaganda, forced fanaticism of boring topics, and adding color

I've learned that on this forum you need to sugarcoat it.

Adults call it "diplomacy". I'll leave it to you devise your own lexicon to refer to those that use the phrase (preferably with a sneer) "have to sugarcoat everything".

See also: "I'm just saying what everyone else is thinking."

There's plenty of room to be civil without 'sugarcoating'.

MS abolished manual testing (STE) positions prior to Windows Vista being released, somewhere around beta 2 or rc1. Vista probably suffered from this but Windows 7 didn't.

Just a note:

I was referring to both people doing manual testing and doing development of automated testing in the case of Microsoft.

My understanding (from the outside) is a that substantial number of people at Microsoft who worked in those areas were made redundant. From what I could gather the goal was to make the majority of testing automated and have the self test systems be developed by the developer of the respective subsystem themselves.

All of the weird regressions I've seen with Win 10 myself (and read about many people experiencing) matches that story.

My feeling is that with something as complex as a desktop OS that needs to work with the a) the history of Windows releases, b) the history of Windows apps, c) the entire, insanely big spectrum of PC hardware released the past 5-15 years or so you do need an army of relatively highly competent people willing to do lots and lots of manual testing over and over and over and over and over ... again. And of course lots of people to build automated systems.. but you can't really get away from the manual aspect very easily.

No they didn't. Source: I joined MS after Vista was released and worked with manual testers.

Did they have the STE title or SDET? When this happened it wasnt like there was 100% automation in place so developers, architects, SDETs, had to pick up the slack. I personally got stuck verifying bug fixes for our components (kernel stuff and supporting user mode services) in various languages I don't speak (particularly fun for left to right languages) just to get the bug count down. That team had something like 20 STEs that got the axe post beta 2.

No idea what their titles were but they worked entirely running manual tests. I ran some manual tests myself as an SDET but this was different.

I don't think a comment is unsubstantial just because it's potentially inflammatory.

I think it's not because of potential but because it was intended to be inflammatory. You can discuss most controversial and inflammatory subjects by applying some diplomacy and tact.

No, SETI isn't QA or "Test Engineering" at other companies. SETI at Google is like any other infra team, except they specialize in testing infrastructure. All devs write their own tests.

You can think of it like the engineers behind rspec vs people who use it. Or the engineers behind Selenium, etc. They're engineers first and foremost and while you're free to have the opinion that this line of engineering is "meaningless and trite" at Google SWEs really appreciate the tools SETI teams build.

If taken seriously, writing test code can be both a technical challenge in its own right, and can also fold in a bit of the fun that hacking systems can offer. You're out to break things rather than exploit them, but it can be similarly fun.

I think there's a certain amount of recursiveness in the way we look down our nose at test code; we don't like it, so we write code that isn't easy to test, which makes it even harder to write code that is testable in the next composition layer of the system. Repeat a few times and the test code gets pretty horrifying. But somebody needs to crack that nut.

"Let he who is without sin^H^H^Hbugs cast the first stone."

Developers ought to be appreciative that there are people willing to do QA and test infrastructure for them.

(also, speaking as a SETI at Google, I've never written a test for someone else's code. I build tools that make testing easier. I write tests for my own code, because I'm building applications that are complex enough to require tests.)

Tools and infrastructure SWEs at the bigger companies get paid as much as regular SWEs

Without giving away too much detail, I was switched from a "tools and infrastructure SWE" (quoting you, not the actual job title) to a regular SWE halfway through the interview process at a top company. The base salary stayed the same but the stock grant got a lot bigger and I was treated better in other ways. I also got a SWE offer from the other top company right before the change occurred, so these factors are entangled and I can't establish causality, but I feel the recruiter considered me higher priority as a standard SWE candidate and told me about the change like it was great news.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact