Hacker News new | past | comments | ask | show | jobs | submit login
The 2002 mandate for internal communication systems at Amazon (sametab.com)
634 points by anacleto 22 days ago | hide | past | web | favorite | 176 comments

Yegge's post was very interesting reading, and I took similar learnings away from it. I was at Amazon at the time, however, and there were things that certainly weren't true any more:

>3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

API first... except if you want to be a number of certain new services that somehow managed to get away with not presenting an API, even though an API would make every service team's life easier.

> 5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

Except, similar to above, where teams apparently decided they didn't want to think that way at all and management just let them.

> 6) Anyone who doesn’t do this will be fired.

Unless your exception is perceived providing value to the company. Then you'll get lauded, and everyone is told they'll need to use your js laden, web only interface, and to hell with any automation.

Mostly those exceptions just codified further in my mind about just how right the Bezos email Yegge paraphrased actually was.

(Former Amazonian, part of the team that drove the change to SOA at the time)

> Now I think that this internal email is what has actually stuck with me the most. Bezos realized that he had to change the internal communication infrastructure [..].

> He understood that a radical organizational change was required to arrange the internal dynamics in a way that would allow the creation of something like AWS.

This is quite a strong and opinionated statement. I'd agree that Jeff made this change to improve communication infrastructure for Amazon.com. I wouldn't agree that he made it to enable AWS - for two reasons.

First, Amazon started thinking about EC2 and S3 way way later than this. Second, Kindle and SimpleDB (now Dynamo) were similar independent bets. These projects were happening at the time of moving to SOA, but they weren't tied to it.

The one thing common across all of these products is - making informed bets on market fundamentals and enabling teams to deliver on them.

Now, I know of a few fairly senior Amazonians who read those forums. They have more context than I do. So, if they chime in and say that Jeff sent this email to enable AWS, I'll gladly take it. :)

Until then, I wouldn't extrapolate Steve Yegge's post to mean something about Jeff B.'s intentions, and build an overarching argument over it.

(Another former Amazonian from the same team)

ozgune is correct as far as I remember.

The beginnings of AWS, though, were for the retail site. We exposed search and browse and item metadata via an API first for a deal with AOL to provide them with product search and then later we opened it up to everyone. People built stuff like Simple Amazon on top of it, which I thought was pretty nice. I don't think this part of AWS exists anymore (please correct me if I'm wrong).

The first service exposed as what we'd today consider AWS was SQS and at the time it was kind of a head scratcher. Only later did we understand that it was glue for other services.

S3 and EC2 were quite a bit later I think.

i though the first one was mechanical turk.

> Second, Kindle and SimpleDB (now Dynamo)

Minor nitpick, these are two separate services. SDB is still independently accessible and probably will be for a while. That being said they definitely don't encourage any new use of SDB and push Dynamo instead.

Kind of a shame because a modern NoSQL store with SQL support would be great for rapid prototyping before a concrete data schema is established.

dynamodb is the spiritual successor of sdb.

sdb is build on the same principles outlined in the dynamo paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sos...

I actually really liked Simple DB as a place for configuration.

whole-heartedly agree

I was under the impression that before AWS, Amazon did SOA'd almost the entirety of its departments, as in OP mail, every dept had to enable web services to other depts, creating an environment where

Between 2000/2003 Amazon started to learn how to properly make something that resembled AWS (launched in 2006), it was a necessity targeted towards scalability. What Yegge letter meant is that Amazon was doing SOA to all its depts, making these be useful both internally and externally

>"5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world."

So you see, it wasn't to enable AWS as the digital service as we know it, it was to build a scalable company that could use everything it had both to serve its needs, and also to sell it as a product (then, all services we all know EC2 etc). Bezos applied SOA as a company, not only to its digital philosophy. And this corporate move, with no precedent, enabled not only AWS, but the whole Amazon as it is today.

>> making informed bets on market fundamentals and enabling teams to deliver on them.

Given the recent 20,000 startup ideas post, this is the perfect algorithm to cut through the noise

How do services talk to each other without an API? Is it something like "put a non-well-documented object into a queue?"

A queue if you're lucky!

There's also:

- Hire an intern / "Customer Service Representative" / "Technical Account Specialist" to manually copy data from one service into another

- Dump some file in a directory and hope something is treating that directory like a queue

- Read/write from the same database (/ same table)

Or the classic Unix trajectory of increasingly bad service communication:

- Read/write from the same local socket and (hopefully) same raw memory layouts (i.e. C structs) (because you've just taken your existing serialized process and begun fork()ing workers)

- that, but with some mmap'd region (because the next team of developers doesn't know how to select())

- that, but with a local file (because the next team of developers doesn't know how to mmap())

- that, but with some NFS file (for scaling!)

- that, but with some hadoop fs file (for big data!)

Obviously all of these are at some level an 'application programming interface'. But then, technically so is rowhammering the data you want into the next job.

Don't forget the most important step.

"Think of the acronym CSV. Don't look up the definition of the format, just meditate on the idea of the format for a bit. Then write your data in the format you have just imagined is CSV, making whatever choices you feel personally best or most elegant regarding character escapes. Pass this file on to your downstream readers, assuring them it is a CSV file, without elaborating on how you have redefined that."

"Comma separated values? But my data has commas in it! Ah, I know, I'll use tabs instead, I've never seen a user put a tab so that'll work perfectly forever and definitely won't cause a huge fucking mess for the poor bastard who has to try and decipher this steaming pile."

Just use ASCII 1E and 1F.

This is awesome, where does it come from? Google does not give me anything.

The quotation marks are stylistic rather than for attribution.

My personal experience comes from ingesting product feeds from online stores. Misapplication of \ from other encodings was the most common sin, but I'm pretty sure I saw about three dozen others, from double-comma to null-terminated strings to re-encoding offending characters as hex escapes. (And, of course, TSV files called CSV files, with the same suite of problems.)

(Source: I work in a software company)

On-hand work experience.

> rowhammering the data you want into the next job

This wouldn't quite fit the "obfuscated C" contest, but I feel like there should be a prize for a system that does useful work this way.

> Dump some file in a directory and hope something is treating that directory like a queue

Or it's unencrypted files on an FTP server containing literally the lifeblood of the American economy: https://engineering.gusto.com/how-ach-works-a-developer-pers... -_-

ACH is not the lifeblood of the American economy. If you removed ACH there would be unimaginable distruption but I can't see that economic activity would completely cease. I think the lifeblood of the American economy is our population.

> Read/write from the same database

This gets abused even within one service. If I could get my coworkers to FFS stop using rows in a database as a degenerate kind of communications channel between components (with "recipients" slow-polling for rows that indicate something for them to do), I'd be a lot happier.

> rowhammering the data you want into the next job.

I hope the aforementioned coworkers don't read HN.

Some methods I've commonly seen in Enterprise Duct Tape:

Screen scrape the other service and do data exchange via a Selenium script.

Directly interact with the other service's database.

CSV files and nightly batch jobs.

Oh damn you just reminded me. They tried to bring in "Robotics" (e.g. Blue Prism) here. For in-house apps.

Apparently it was so hard to deal with the developers that instead of exposing an API they would automate clicking around on Internet Explorer browser windows.

Thankfully I haven't heard much about it lately.

I raise you a service which gets it's configuration from a table on confluence page

It's funny to think about but the reality is that it's better than a lot of other options.

- The page has automatic history & merge conflict resolution - There's built-in role-based security to control both visibility and actions (read-only vs. read-write) - You can work on a draft and then "deploy" by publishing your changes - You can respond to hooks/notifications when the page is updated

Considering that the alternative is either editing a text file on disk or teaching business users to use git, a Confluence page is not so bad.

As crazy as that may seem on the face of it, it’s actually kind of genius for merging the roles of those who need to configure / don’t get git, and those who need to develop against the configuration.

Confluence is at least canonically XHTML so this is better than a lot of data lake bullshit I've seen.

DAAC - Documentation As A Configuration

and I thought the screen scraping / Selenium solution was wack! Wow!

This is one of the niches where (S)FTP and batch processing is still alive and kicking.

Yeah, SFTP + CSV file is still the standard for enterprise software.

The problem is that these kinds of things have to be built to the lowest common denominator, which is usually the customer anyway. The customer in enterprise software is usually not a tech company they typically have outdated IT policies and less skilled developers than a pure tech company would have. Even if the developers are capable of doing something like interacting with a queue they also need to be supported by a technology organization which can deal with that type of interaction.

Some times you get lucky and someone in the past has pushed for that kind of modernization. Or your project really won't work without a more advanced interaction model and you have someone in the organization willing to go to bat for tech enhancement.

But otherwise the default is "Control-M job to consume/produce a CSV file from/onto an SFTP"

My experience is that usual reason for RPC-over-SFTP is that it is the only thing that corporate IT security does not control and thus cannot make inflexible. Adding another SOAP/JSON/whatever endpoint tends to be multiyear project, while creating new directory on shared SFTP server is a way to implement the functionality in few hours.

its also quite common in fixed data logging applications, such as exports from bms.

Directly interacting with other service's DB is an "Enterprise Integration Pattern" AFAIR? It can make sense in lots of cases.

Shared database is, in fact,a classic enterprise integration pattern, and much of classic relational database engineering assumes that multiple application will share the same database; in the classic ideal, each would use it through a set of (read/write, ad needed) application-specific views so as to avoid exposing implementation details of the base tables and to permit each application, and the base database, to change with minimum disruption to any of the others.

Reading data from another application's database is pretty common (although even this can cause chaos if done without some care) but writing to application databases is often a very bad idea and often explicitly forbidden by CRM/ERP vendors.

Calling parts of your application "services" means that you're already thinking in terms of the "services/API" metaphor. If you're not using services, you might be, for example:

- Building one large application (monolith). Parts of the application communicate with each other via function calls. Everything runs in one large process. You can go quite a long way with this approach, especially for parts of the application that are stateless. (You can also build components of the application using a service/client metaphor within the process as well.)

- Multiple separate applications might communicate with the same database, file system, or some other data store. Before we had distinct distributed systems components taking on the role of queues, event buses, and things like that, it was common to represent queues using folders or database tables. These approaches are still seen today, though they're uncommon in new applications.

My company increasingly understands "platform" to mean "codebase to build your feature into" rather than "API to consume from your own codebase."

IBM has an appliance for you:


It's a middleware appliance that allows any API to talk to any other API and handles just about any data format. You can script it with javascript or XSLT. It can handle ad-hoc things like ftp polling.

It has the added benefit that you can add security for outside facing clients.

Disclaimer: I helped develop this appliance (but I no longer work for IBM)

Pricing: Contact Us.

So basically it will be expensive, require an army of IBM consultants and become yet another integration point instead of actually solving anything.

IBM's business model is totally antiquated and exhausting for modern processes. We have had a nightmare trying to get IBM MDM's solution to finally admit they were not actually cloud-ready after saying repeatedly that they were. No TLS support for DB2 out of the box for K8 support, documentation sucks. But contact us for pricing. IBM sucks.

Cries in Webseal.

This is the Enterprise Service Bus concept, right? I remember a pretty good conference talk about how we realized in the mid 2000s that these things are problematic and you probably want services communicating directly over dumb pipes.

In showed an evolution from a monolithic spaghetti codebase, to an SOA, to realizing there are now spaghetti connections between services in an SOA, to a very clean looking ESB architecture, to showing how all the chaos is still there, just inside the ESB where it’s even harder to reason about or change.

Another favorite:

You integrate a "service" by creating and linking a library that implements the entire service, including data access. Now try tracking down everything accessing the "service" database, or rolling out an upgrade to the "service".

Call out to a shell? Pass structs between C-based programs using sockets? Use runtime marshalling? Make your "API" be just different arbitrary JSON objects mapping to maps? Hide the entire API behind a lazy cache with undocumented side effects when cache misses occur? I wouldn't consider any of these "API"s although they are interfaces.

> Make your "API" be just different arbitrary JSON objects mapping to maps?

But enough about the web...

Faxes, like in US health care. With staff to handle it at the endpoints.

Was Amazon the first major company to invest fully in SOA for internal use like this?

The critical decision here is not SOA for internal use, but SOA designed for external use, for internal use. This is still uncommon.

I have to ask. Isn't this just microservices ? Also isn't this terrible for latency and debugging ?

"Microservice" is just a new name for something that has had many names throughout history. For as long as we've had networks, we've had small, loosely coupled networked apps communicating with each other. We've also had standards for communicating and for discovering nearby services — at one point it was DCE/RPC, later CORBA or DCOM, or Java RMI. Then people started with REST and Thrift, and these days it's more often gRPC and GraphQL.

I think the core of the Amazon invention was to mandate that the API be a strict boundary around a service. No cheating by making assumptions about the internals of the implementation. Force teams to work off the published API documentation, and make teams the guardians of each service. Microservices probably aren't developed exactly like that in many companies.

> I have to ask. Isn't this just microservices ?

Considering 'microservices' wasn't a well-known term for another decade [1], maybe better to ask: isn't 'microservices' just this ?

[1] https://trends.google.com/trends/explore?date=all&q=microser...

> Isn't this just microservices ?

Yeah it is (though not so "micro-": it's really one service per team, which is the scale that actually makes sense IMO). This is kind of a "seinfeld is unfunny" case: this memo was in 2002, and predated (arguably helped create) most of the modern microservice approach.

> Also isn't this terrible for latency and debugging ?

When two codebases are owned by different teams (and potentially in different languages), it's far better to have them separated by a network connection and a well-defined HTTP API than living in the same address space where they can corrupt each other. Since the interfaces are well defined, you can isolate any problem down to a request that's not getting the correct answer from the service it's calling, and then hand that over to the other team to investigate. Since the request and response are plain text, it's easy to see what's happening.

If responding to a single request from the end customer requires chaining through several independent codebases maintained by different teams, you have bigger problems than request latency. Each team should own a piece of end-to-end functionality, not a horizontal layer.

Not only that, but you can also track performance/crash rate/etc of each service individually and over time, which is a really powerful tool to have. Infinitely better than a monolithic system at that scale.

Agree. The "Micro" prefix is very miss leading. It should be one per (pizza) team. Otherwise you end up with too many, and it becomes a mess.

This email from 2002 describes system design decisions which led to amazon developing AWS. I wasn't a professional programmer at the time, but as I understand it, I think service-oriented design patterns were rare, let alone microservices as a concept.

Rules are meant to be broken as the saying goes.

This was not exactly a Jeff Bezos mandate but the result of an engineering brainstorm. The mandate came more out of a "how to scale Amazon for the next decade" discussion. In large companies, one where distributed/independent teams are as important as distributed systems this ended up being the only way to operate.

Initially, during the good old days of Amazon, there was what you'd call a single datawarehouse. It made sense initially for every system that processed an order to access the data by querying that data warehouse—this meant that the processes would be distributed (different services), while the data would be centralized. It also meant that any change to the way the data is stored in the datawarehouse meant deploying code to a hundred places.

The most important problem that this addressed was however different. A centralized datawarehouse meant that every customer request bubbled up into N queries to the datawarehouse (where N is the number of services that needed access to the data—billing, ordering, tracking...).

The mandate summarized in one line would be this—"the data is the one that should go to the services, not the other way round." Voila, microservices.

Ok, we'll take Bezos out of the title above.

Thank you, dang! This is certainly a more accurate title.

Hi! I’m a senior engineer at Amazon. Throwaway account but I’ll try to respond to questions if anyone cares to ask.

Yeah we use services heavily, but there’s plenty of teams dumping data to S3 or using a data lake.

There’s also the “we need to do this but management doesn’t see value so let’s dump it on the intern or SDE 1, who we won’t really mentor or guide and then blame, forcing them to switch teams as soon as they can.”

If you work at another company and think we have our stuff figured out at Amazon, we really don’t. We have brilliant people, many of who are straight up assholes who will throw you under the bus. We have people who are kind and will help you gain all kinds of engineering skills. We also have people who are scum of the Earth shit people who work at Amazon because I don’t think any other sane company or workplace would tolerate them. We have extremes on the garbage people end of the spectrum, unfortunately.

Sorry long rant - point being - it’s good to learn how we do things. The internal email on services is pretty unique. I learned about it when I was an SDE 1 back in the day. But - don’t take it as gospel. It doesn’t mean you need to build services.

I can think of any number of examples where we follow anti patterns because no one gives a shit about the pattern, whether it’s a service, a bucket, a queue, or a file attached to the system used for scrum tasks, or shit passed over email... we care about value at the end of the day. If you don’t provide sufficient value at Amazons bar, they have no problem tossing you out the window.

> While the third point makes all the difference in the world, what Amazon really did get right that Google didn’t was an internal communication system designed to make all the rest possible.

> Having teams acting like individual APIs and interacting with one another through interfaces over the network was the catalyst of a series of consequent actions that eventually made possible the realization of AWS in a way that couldn’t have been possible otherwise.

Google has worked this way since time immemorial. That’s what protocol buffers are for: to create services and pass data between them using well defined interfaces.

A protocol buffer is a serialized data object, not a service API. It doesn't (and didn't) prevent anyone from using shared memory, shared database, or shared flat files to communicate.

Also, 2002 is time immemorial. Google was founded in 1998.Protocol buffers were invented in 2001.

Protobufs were invented for Stubby, the RPC layer which is apparently used for absolutely everything inside Google. It's existed since at least 2001, and uses protobufs as the RPC serialization. gRPC is based on Stubby (though not the actual implementation).

Yeah. Google is actually a much better example of this mentality than Amazon is, if I'm reading the thread right. Google Cloud isn't behind AWS because of some service architecture nonsense. It's behind because Google started later, and it started later because for the longest time (I was there) the senior management had the following attitude:

"Why would we sell our cloud platform? We can always make more money and have higher leverage by running our own services on it and monetising with ads; merely selling hardware and software services is a comparatively uninteresting and low margin business."

Selling Google's platform (and it really is a platform) is an obvious idea that occurred to everyone who was there. It didn't happen because of explicit executive decision, not because Bezos was some kind of savant.

I think Google could have really dominated the cloud space if they'd been a bit more strategic. The problems were all cultural, not technological. For instance they are culturally averse to trusted partnerships of any kind (not just Google of course, that's a tech industry thing). There are only two levels of trust:

- Internal employee, nearly fully trusted.

- External person or firm, assumed to be a highly skilled malicious attacker

There's nothing in between. So if your infrastructure can't handle the most sophisticated attack you can think of, it can't be externalised at all. If it can't scale automatically to a million customers overnight, it can't be externalised at all.

There's really no notion in Google's culture of "maybe we should manually vet companies and give them slightly lower trust levels than our employees in return for money". It's seen as too labour intensive and not scalable enough to be interesting. But it'd have allowed them to dominate cloud technology years earlier than AWS or Azure.

I have to wonder if maybe Google's management weren't right. Why are Google in cloud, a low margin business?

Protocol buffers are both serialized data objects and service API's.


> Google has worked this way since time immemorial.

Do internal Google services exclusively use the Google Cloud Platform APIs? The implication is that internal Amazon services exclusively use AWS APIs. I've never worked at either company though, so I don't know if it's true. Perhaps someone could clarify.

No, because internal APIs for the same backends are more powerful and easier to integrate with. It's not easy to make the external APIs as useful and powerful as internal ones: for one, you can trust your users more to do the right thing and not try to exploit you for profit, it's much less committment to offer certain functionalities, since it's easier to roll them back if only users are internal, etc. Google is slowly getting there in feature parity, but so much effort has been invested in Google internal ecosystem even before GCP has even existed that, as good as GCP is, nobody wants to bet on it when obviously superior alternative is available.

> for one, you can trust your users more to do the right thing and not try to exploit you for profit, it's much less committment to offer certain functionalities, since it's easier to roll them back if only users are internal,

"Organizing into services taught teams not to trust each other in most of the same ways they're not supposed to trust external developers."

- from Steve Yegge's Platforms Rant[1]

1. https://gist.github.com/chitchcock/1281611

Define "nobody." If you check earnings you'll be surprised to see GCP is quickly growing, even if it is a few years behind AWS. For large customers, their deployments are increasingly becoming platform agnostic, meaning they shop for price, not platform. Yegge's post, while relevant to almost every cloud customer at the time, is less relevant to large customers that can avoid lock-in today.

I am talking about people working inside Google, not customers, as should be apparent from context.

> The implication is that internal Amazon services exclusively use AWS APIs.

Untrue. Some even use older version of AWS "core" services (S3), not the latest and greatest version of AWS services.

though in some cases I can think of, the "legacy" service is now just a facade over an AWS service.

Lots of places will use APIs between teams, but the APIs aren’t primary. An integration still starts with a document and a kickoff meeting. You might need to post config or even code changes to the platform, which the platform team will review and deploy at their own pace, before the endpoint is usable. Interfaces and schemas have use case specific fields in them. It’s not at all like using an AWS service.

I worked at an organization that had a similar declaration. Here's how it played out:

1. Everyone is super excited for other teams to share their data

2. Everyone wants an exception from sharing their own data because it's too hard or too sensitive to share.

3. Eventually everything gets shared, but it takes 3-4 times longer than it really should.

4. A couple of years later you want to stop an obsolete interface but you can't because a handful of systems use it and they dont have budget to change.

True this happens, but you're still better off that if you had tight coupling. In the absolute worst case you can make a shim implementation to support the obsolete use case. You can not do this when callers are directly reading your database/memory structure.

Yep, I’ve implemented new ticketing systems that have to talk to the ancient ticket system for certain things or vice versa. The ancient system had direct DB connections that had too many down/upstream dependencies and not enough budget or political backing.

Yup and getting anyone to write coherent documentation for their new interfaces is like pulling teeth.

Yes, but you'll at least have the API definition. And you work at the same company, so you can show up at the desks of the team responsible and demand answers. And if it breaks in production you get to page them and they have to wake up and help you! The threat of pages is a great way to coerce decent documentation. It's an important principle at Amazon that if a production service has a dependency on you, then you are also a production service. Another benefit of breaking things up into SOA is monitoring individual services. If your API is returning 500s then it's your fault and your problem (at least until you can root cause to one of your own dependencies that's returning 500s, then you can pass the buck).

IIRC when Yegge accidentally posted that rant, the entirety of Amazon corp got IP banned from Hacker News from _everyone_ rushing to view and comment.

Something like Conway’s Law was also recently cited by Elon Musk (jump to 3:30) https://www.reddit.com/r/SpaceXLounge/comments/dbttaw/everyd...

Direct link to 3:30: https://youtu.be/cIQ36Kt7UVg?t=206

At a previous company, a senior manager took Yegge’s blog post and presented it internally as his own original work.

Hilarity ensued.

Found the original post from Yegge a really interesting and thought provoking read. Didn’t realise from the context that he originally accidentally posted it as a public rather than private Google+ post!

His follow up post explaining this, and with an interesting anecdote about presenting to Jeff Bezos, is archived here (seeing as G+ has, ironically (or not) given the context, shut down): https://gist.github.com/dexterous/1383377#file-the-post-retr...

At least as of 3 years ago when I left, the software systems that drove the mandate towards SOA were still massive systems that communicated almost purely through a monolithic Oracle database. It was the software system(s) that was responsible for all automation and accounting at fulfillment centers. This is one of those rare times where I actually think a full rewrite from scratch would have been a better idea.

I wish the actual body of the email was available and published. I’ve only read Yegge’s account of the note and didn’t see it in any of Bezos’ books.

I suppose it’s nice that the email, or really any amazon emails, has not been leaked.

I only could find his autobiography. What other books does he have / wrote?

Brad Stone wrote a Bezos/Amazon bio called “Everything Store.”

> 6) Anyone who doesn’t do this will be fired.

So I've never worked at a company over 150 people. Is this... a normal thing for an email? Maybe I'm just one of those softies but an email with that line would throw me off my day and cause a serious hit to my morale and confidence of working there.

Elon Musk did this fairly recently at Tesla:

"There are two schools of thought about how information should flow within companies. By far the most common way is chain of command, which means that you always flow communication through your manager. The problem with this approach is that, while it serves to enhance the power of the manager, it fails to serve the company.

Instead of a problem getting solved quickly, where a person in one dept talks to a person in another dept and makes the right thing happen, people are forced to talk to their manager who talks to their manager who talks to the manager in the other dept who talks to someone on his team. Then the info has to flow back the other way again. This is incredibly dumb. Any manager who allows this to happen, let alone encourages it, will soon find themselves working at another company. No kidding."

Information doesn't have to flow through managers specifically and it certainly doesn't have to flow up the management chain and down again, but there ideally should be preferred funnels for the communication and breaking out of it should be an exception, or pre arranged, and not the norm. Otherwise you'll have, say, a particularly useful person on a team see his time filled with requests coming from a lot of different directions and he then has to also manage the priorities and communication around this too, and then he becomes a bottleneck for the entire organisation.

If you have smart employees respectful of other's time it shouldn't be a problem. I would tend to agree with Musk on this one. Processes are often put in place to counteract poor hiring. If you need inefficiency raising processes to defend your business from your own employees, maybe you don't have the right employees.

The problem is if 12 teams all are independently interested in talking to a specific person. Even if that could be condensed in a few meetings organizing it should not fall on the unlucky employee of interest.

Handling these complexities of scale are exactly what managers are for.

True, but what you're describing is the exception and not the norm. If it is the norm then the issue is that management has failed to hire enough technical documentation writers.

Yes, this is meant to be the exception, the point I wanted to make is that the solution is not tearing down management but fixing it.

Managers can facilitate the establishment of a direct connection. Good ones are routers or switches or whatever the analogy would be!

Heh. Threatening the managers. I love it.

I strongly believe that Steve was exaggerating for effect here. In my 17 years at Amazon I have never seen or heard of a threat of this nature. The overall intent of the email was to tell teams to decouple, decentralize, and to own their own destinies.

An ex Netflix person, who has since moved to Amazon, spoke at a client place three weeks ago. He casually mentioned things such as "we forgive the first time, and we fire the second time". From how he spoke, we felt that this may be the norm in Silicon Valley and related places.

I have much respect for what he has achieved, so I didn't interrupt to question such a fear-inducing mindset.

With such important services as streaming Seinfeld, it's easy to see why such a scorched earth policy is necessary.

There are compounding productivity boosts available when a team can all trust each other to basically never cut corners or make sloppy mistakes. Removing a tenth team member who is not up to the bar of the rest of the team can make the remaining 9 members each more than 11% more productive.

Of course, this strategy has its downsides. You can't ever hire juniors. You can't really hire people in and train them up at all, because everything has been built under the assumption that only experts will ever touch it. This makes an organization that operates like this inherently parasitic to the industry, only capable of hiring in experienced employees from other companies.

I remember reading that their head of HR was very supportive of this firing policy and she was herself fired: https://www.fastcompany.com/3056662/she-created-netflixs-cul... :)

Thanks for sharing this article.

The part about McCord firing an employee is sad and funny at the same time. To begin with, it looks like a page from Vonnegut’s Player Piano:

“McCord mentioned letting go a product testing employee who “was great,” but eventually lost her job to automation.”

It isn’t mentioned if the employee contributed to the automation that eventually replaced her, but it may as well be so.

Nevertheless, her depiction of the conversation is a rare mix of sad and funny:

“So I called her up. I’m like, what part of this is a surprise? … And she goes, yeah, but, you know, I’ve worked really hard; this is really unfair. I’m like, and you’re crying? She’s like, yeah. I’m like, will you dry your tears and hold your head up and go be from Netflix? You’re the–why do you think you’re the last one here–’cause you’re the best. You’re incredibly good at what you do. We just don’t need you to do it anymore.”

One man's "fear-inducing" is another man's "results-driven"

Yes, IIRC in Steve's original blog post he points out that that particular point was not in Bezos' email and notes he put it in there for dramatic effect.

What I remember from Yegge’s blog post was that the “have a nice day” bullet point was a joke; Bezos doesn’t spend much time worrying about how each employee’s day is going.

Thanks for the reality check. Phew.

Very few managers in large companies can throw around threats like that, and the ones who can realize it just makes them look unhinged like the Queen of Hearts in “Alice in Wonderland.”

Incidentally, a couple of years ago, I was a contractor for Amazon for a very short time. One thing that stood out was that customers would email Bezos directly, or problems would bubble up to him in some other way, and he would respond by forwarding the email to an appropriate email list. Bezos’s email would be just a question mark, but he always got a fast and complete reply. Once people knew he was paying attention, they would do anything to resolve the problem. He didn’t need to formulate an actual question, or even a general “what’s this about?”, just a literal question mark would do the trick.

That is pretty normal, both in the CEO or founder just asking or having to ask a question and the quickness and lengths that people go to to address it.

People don’t actually get credit for addressing the issue. People just get dinged for not addressing the issue quickly.

It is actually a red flag if a CEO or founder doesn’t behave this way since it means that they aren’t showing care and attention or that they are operating at the wrong level.

Not in such language, but it's absolutely the case that large companies can have emails come from the very top that include language like "subject to disciplinary action and possible termination" for failing to do some required thing. Sometimes it's phrased as "corrective action up to and including termination." It's normal in that this is something that shouldn't throw one for a loop (and in my experience the calls to action have more to do with regulation compliance than any how-to-do-your-job mandate) but it's not very frequent. Such language is usually limited to appearing in an employee handbook or conduct guide that everyone is supposed to have read and agreed to which spells out repercussions for behavior against such or other misconduct.

This is Yegge's over-the-top style of humor, none of these are actual quotes from the email.

To be fair, "do this or you're fired" is just assumed of every request at Amazon, because they follow through on it often.

In my five-year experience at Amazon that has never been true.

I think the context matters. In this case, 'not doing this' doesn't mean making a mistake, it means ignoring a company-wide imperative, being insubordinate, hubristic, etc.

I acknowledge that it's probably a joke anyway.

I've worked at ~250 person company, a multinational conglomerate you've definitely heard of, and everything in between.

No, that sort of overly forceful yet flippant language it's not normal.

I know of one CIO (between 10k and 1k employees) who operates by tell/train/terminate. Tell the staff what to do, if they can’t then train them, if they’ve been trained and can’t/won’t then terminate them.

Isnt that competence?

My intention wasn’t to judge the position, just pass along the data point.

I wish emails like this were sent out in the places I've worked. Been surrounded by people with no initiative to do what's required, dragging everyone else down.

Your parent comment kind of ripped that one out of context by omitting the following statement that is supposed to contrast with the preceding statement in a humorous way (IMHO).

    6) Anyone who doesn’t do this will be fired.
    7) Thank you; have a nice day!
This makes it sound like hyperbole instead, or a quip used to essentially communicate "It's very high priority".

I've worked at several large companies via acquisition.

Yegge has punched things up a little, but I can translate #6 into what I believe was the original text: "This represents the new direction for {company}, and anyone who does not wish to align with this new direction should pursue other opportunities".

The day-to-day employee at a large company becomes numb to executive emails, as 90% of the time they will be countermanded a month later. A sentence like the above is inserted to wake people up and indicate "no, seriously, this means you".

It is quite usual in global consulting organizations. The format is always like this: "X has happened, is happening or may happen. X is bad. Any Y or Z actions identified in our workforce related to X will be subject of termination of employment."

It is corporate language for "this is our first and most important priority and should be noticed and followed by anyone"

Correct me if I'm wrong, but it seems the actual email is neither quoted directly nor linked in the post.

The article mentions this as dog fooding, but does that really apply here? Did they do this with the idea in mind that they'd turn this stuff into a product? It struck me as Bezos wanting things built for the future, reducing technical debt, and the product-ification was an excellent byproduct, but perhaps not intentional.

At the time Amazon was building out their merchant portal as a white-label-ish service for other large retailers to sell products online. The 'customers' in the memo would be other merchants, and the early AWS offering (e.g. SQS) reflect this. "Elastic" clouds weren't really on the menu yet but obviously part of the point is that you can offer it to customers regardless of where the architectural fad goes.

I remember the early days of target.com and (I think?) toysrus.com being thinly skinned versions of Amazon.

Yep. Also one of the large British retailers and a few others.

Yes, if I remember correctly even Borders Books got in on the action there.

Eat your own dogfood.

You can't sell to customers effectively if your flagship product only works because it has access to resources the customers will never have... and it is designed around that flagship's needs and not your customer's needs.

AWS is largely a side-effect of this memo, not its instigator. At the time, Amazon's dog food would be books/clothes/literal dog food.

Yegge's article never says it was an email. What should the title be?

Edit: I've taken a rather lame crack at it and am open to improvements.

The circumstances that Yegge described happened somewhat before my time, but I suppose you could call it an "internal goal" or "internal mandate".

Amazon's not really big on "mandates" in general, but the term seems to fit Yegge's characterization of what happened. "Internal goal" would be another way to phrase it. E.g., "The single most important technical goal in the history of Amazon".


love the new title

A lot of interesting thoughts here but the author doesn't really wrap them into a conclusion. A whole lot of words to say "they all work and it depends".

I love some of Amazon's executive policies. From what I've read, everyone has to write a multi-page paper before executive meetings, and everyone has to read it, so the meeting goes smoothly with everyone understanding the issues. I hate how no one reads anything in most organizations.

Not sure about execs, but this happens in engineering meetings (regarding new features being implemented or other semi-major changes). Whoever is initiating the meeting writes up a paper describing the terminology, the nature of the change and why it's needed, how it will be implemented etc. The entire dev team (+ maybe other dev teams within the group), management (the initiator's boss + 1 level above, maybe other dev team managers too) start the meeting with hard printouts of the paper, armed with red pens. The meeting "starts" with ~15 mins or so of silence for everyone to review the paper in the room from start to finish. Then the paper is reviewed end to end and torn up on the way. Often there are multiple of these meetings (i.e. first one went badly or if things change along the way of building/implementing it and questions come up)

It happens with Jeff, which is why it happens with the execs, which is why it happens with the eng teams.

It doesn't happen at many other companies, even ones stocked to the gills with Amazon refugees.

Yegge has another famous blog post about presenting to Jeff.

That sounds like a low-fi, synchronous, in-person version of reviewing a Google Doc (with comments, suggestions, etc).

I’ve generally found that forcing people to dedicate time to read and discuss a design is more fruitful than a google doc

Alternatively, the people who are scheduled for the meeting are guaranteed to spend 15 reviewing the subject and having their questions answered in a timely fashion. All of the stakeholders are likely in the room so a decision can actually be made.

It forces real-time immediate interaction and discussion rather than dealing with people that have varying levels of understanding and who may or may not have bothered to actually read the whole thing. And it forces all of the comments and suggestions to be hashed out in 30-60 minutes, rather than over weeks as people only check the doc maybe once a day if you are lucky so every back and forth takes forever, especially when a lot of it is dealing with the aforementioned sources of understanding.

A single hour-long in-person meeting can forgo a month of back-and-forth online.

Sounds better than the alternative, which I'm all too familiar with.

"Did everyone get a chance to read the doc I added to the calendar entry last night? No? Ok, I'll just review the main points. So for starters, let's go talk about why we're here...."

> what Amazon really did get right that Google didn’t was an internal communication system designed to make all the rest possible.

I'm not following what he means. What is the thing he is describing as "an internal communication system" here? That made all the rest possible? What is/was this internal communications system?

It seems the OP is interpreting Bezos mandate as if he were talking about the interaction/communication between actual teams (people) as opposed to between each teams software service(s). Whereas there are parallels between these, I'm pretty sure Bezos wasn't saying anything about inter-team communication. It sounds like they already had a good company organizational structure but there were no limits on how services used each other, which would result in tight, often-hidden coupling that would be too complex to manage.

My guess is this was the thinking behind the mandate, limiting all inter-service coupling to a single, well-documented interface prevents the whole system from devolving into a big ball of mud; primarily because it forces all dependencies to be explicitly declared and documented in the same way.

I'm assuming Yegge was referring to the RPC framework.

"an internal communication system" does sound like something like an "RPC framework", but Yegge's paraphrase actually says "It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter. Bezos doesn’t care."

I read this as saying different teams/services don't have to use the same thing either. That doesn't sound like an "RPC framework" or "an internal communications system" at all. It seems to leave the door open to everyone doing things in a diverse mishmash. Which isn't what I'd call "an internal communications system" at all.

But was/is there in fact an Amazon-specific "RPC framework", that all Amazon services use, some consistent framework used consistently accross services? I haven't heard much about this before so am curious to learn more. I haven't heard of an Amazon 'RPC framework' before, or what it's called, or what. And OP doesn't specify it either; does the rest of the audience know what's being talked about, and I'm just missing context?

If that is the thing that the OP thinks is really what Amazon got right... then the interesting thing is figuring out how it went from the paraphrased email, which doesn't actually demand such a thing, to.... such a thing. Who designed or chose this "RPC framework"? When? How? How'd they get everyone to use the same one? If that's the thing Amazon got right, there are some steps missing between the Yegge-paraphrased email and there, since the email doesn't actually even call for such a thing.

Or is that not what happened at all, and I'm still not sure what OP means by "an internal communication system" being the thing Amazon got right.

This edict was before my time at Amazon, so I can't speak to whether there was an RPC framework in existence when this was mandated.

By the time I arrived, however, there was a cross-language RPC framework that integrated with Amazon's monitoring, request tracing, and build infrastructure (for building and releasing client versions). It was very full-featured and the de-facto system for creating a service. Most of our communication in my organization was done using this framework, and systems that violated the "only communicate over a service boundary" mandate were real problem children.

Interesting, people don't talk about this much, although the OP seems to be aware of it and think it was important.

Does anyone know if there's been much written on how this came to be and what it looked like? If not, it would be a useful thing to write about!

Cause it does seem like a really important thing, without it, the narrative seems to be that you make a decree like Bezos', and bing bang magic, you get what AWS got. Where in fact, succesfully pulling off that RPC framework seems to be really important, and undoubtedly took a lot of work, good succesful design, and social organizing to get everyone to use it (perhaps by making it the easy answer to Bezos' mandate). But none of that stuff just happens, some have failed where AWS succeeded, the mandate alone isn't enough.

I think a lot of Amazon's internal tooling is sort of "unpublished" - I've not found a great reference for a lot of the really excellent dev support they had.

The AWS story is particularly interesting because a lot of the internal setup I was doing at the time was on old fashioned metal. There was an internal project called Move to AWS (MAWS) that encouraged using newly-developed integrations with the AWS systems that the public was using.

In other words, AWS lived alongside old-fashioned provisioning practices up until even the early 2010s.

In my experience with much smaller teams (5-50 programmers) the main challenges are three-fold:

1) getting developers to talk about formalizing (to any degree) cross-system responsibility at all rather than just hanging code where it's easiest at-hand;

2) getting developers to think about external inputs/outputs at all and everything that entails, e.g. namespacing, versioning, forwards and backwards compatibility, validation, access control, ...

3) teaching developers to pick the right kind of interface for their data and processing, i.e. more or less queue vs. pubsub vs. RPC vs. REST vs. query language; of the three this is the only "technical" one, the others will be harder sells.

Once you have done these, the question of _which_ queue/etc. you take is largely irrelevant; there will be some natural pressure to standardize and even "unnatural" pressure from CTOs/management will be straightforward to implement.


The issue if you are developing using such requirements is that the product will end-up quite expensive. A simple messaging or authentication feature becomes a fully fledged multi-tier service maybe with super admins, owners/admins and clients. Dev budget is not an issue for Amazon though...

A few things I'd add today:

* Every service must provide latency and error-rate metrics.

* Every service must be capable of generating and/or responding to backpressure when things become overloaded.

* Every service must be prepared to support multitenancy.

The thing to point out is Bezos is a real techie, and while any business guy would have built amazon on top of msft or google cloud, the fact that he knows about infrastructure made it possible for Amazon to build AWS

Reading Bezos' mandate email puts a smile across my face, every time.

Why does the title of this keep getting flopped around? It's shifted three or four times today. I thought it was supposed to be the title, or the subtitle, and avoid paraphrasing.

> 6) Anyone who doesn’t do this will be fired.

I would have so much loved this approach in the last corporate job I had. It would have changed so many things in such a short time...

here's an article about how the idea of AWS came about. the main takeaway is that it evolved and the article has a lot of 'we' in it, not only 'jeff'


It’s such a loss that Yegge doesn’t blog anymore.

> doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols

So a jdbc interface and published schema would count?

>> Anyone who doesn’t do this will be fired

Right, motivating everyone.. check..

Do other (Silicon Valley) companies do the same?

is this trying to be stratechry in format?

Author should ctrl-f for the many erroneous double spaces. </ocd>

Macbook keyboard maybe?

We have robotic baristas here in SF, but no one uses them. Why? People want to have their food prepared and served by a real human being, in most cases. The food tastes better when it's served to you by a real person.

I believe you replied to the wrong story, unless robot baristas serving you coffee was meant to be a metaphor for using APIs for programmatic communication vs. SFTPing csv files, or something.

If you use an adblocker like uBlock Origin, you can add the following rule: news.ycombinator.com##.pagetop

Unfortunately it removes ALL of the top navbar but I've found it really useful to get around HN's damaging and useless gamification metric.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact