
On Internal Engineering Practices at Amazon - wheresvic1
https://jatins.gitlab.io/me/amazon-internal-tools/
======
anon20190326
Ex-Amazon SDE here.

A lot of this has changed.

First, there is a movement to build a lot of services in Native AWS instead of
MAWS/Apollo.

Apollo doesn't require copying configs anymore; you can have the config exist
as part of the package you are deploying. Generally, that's a best practice.

Pipelines can be configured as code too.

There is a centralized log service which requires onboarding. It does require
some commandline tools, but it works. The logs get stored on S3, IIRC.

If containers suits your needs, you'd be hard-pressed to find someone telling
you not to use it. Generally, though, you would want to use bare metal for
Amazon's scale.

There is also a change to how NPM is being used at Amazon. It was a lot easier
towards the end of my tenure, and was probably as close as it would get when
working with Amazon's build systems.

Amazonians are generally conservative and don't use the latest and greatest
unless it solves an actual customer need. Customer Obsession is still _the_
defining leadership principle.

~~~
m0zg
> you would want to use bare metal for Amazon's scale

Containers are "bare metal".

~~~
jhayward
> Containers are "bare metal"

Please explain this use of jargon. Are you saying that a Docker container is
"bare metal" in some culture? A VM is bare metal? What do you mean by
container, and bare metal?

~~~
m0zg
Containers are just namespaces for things within the Linux kernel. Unlike with
VMs, you're not running separate instances of the OS, it's all run by just one
kernel instance, and that kernel usually runs directly on the hardware, that
is, on "bare metal". That "ubuntu" base image you can spin up does not
actually run the Ubuntu kernel. As a result, bare metal containers incur none
of the "virtualization penalty" that VMs do.

Cloud is, in fact, abnormal in terms of how containers are used in that it
wraps them into VMs. That's in part because VMs were already there and they
simplify provisioning of something that looks like "machines" from the
outside, and in part because containers do not offer much in the way of
security guarantees (precisely because all containers running on the same host
share the kernel), whereas VMs do.

~~~
jhayward
Ah, you're using it to mean "no virtualization". Got it, thanks.

I tend to use the term to mean single-tenant, user-supplied OS, non-
virtualized myself but I see where you're coming from.

~~~
richardwhiuk
All of that may be true as well - if the user picks the underlying OS (i.e.
kernel) to run the container engine and installs on bare metal, then literally
everything you wrote applies.

------
jcrites
Based on my experience, the article contains a lot of misinformation. Some of
the statements might have been true at one point in the past, but are now out
of date by years, while others have never been true in the time I've been
around.

Without getting into a point-by-point rebuttal, my reaction to each
section/Exhibit is "that's wrong/misleading".

~~~
throwaway504
Isn't that a bit disingenuous? Your role at Amazon is nothing like the typical
engineer at Amazon. You live in the shiny new world while the majority of
engineers are stuck on something that's not too far from the article.

~~~
ecarrlee
I was an intern at AWS this past summer(and will be returning as a new grad in
May) and this article is not accurate based on my experience. My code (which
was part of our team's actual production system, not a toy project) ran on EC2
instances in an autoscaling group. There was no manually provisioning servers.
I worked with a Kinesis Stream and several other resources and there was a way
for us to programmatically retrieve the names of the resources. My impression
was the system we used to find a resource name works for all AWS resources so
I imagine you could get the name of a load balancer this way as well. Finally,
I'll just say my team worked entirely in Python. Maybe that was an exception,
but it is certainly sufficient to directly falsify the statement that
"internally, at Amazon, you can't even use a language other than Java." (How
would the author even know this? Was he in on the S-team meeting where it was
declared that all non java users get a pip?)

~~~
tanilama
The Java claim is just factually false.

Java/C++/Python/Ruby/Go/Perl/Scala, even Rust, are used in Amazon for various
projects. What might be true is that, consider how critical your service would
be, the more it is, people will lean towards to more
conservative/mature/enterprise-ish languages, not exotic ones, because that
inherently means risk. And Java is the go-to option under that consideration,
just because, it is Java, the least controversial choice if you had asked me.

But Amazon comparing with other companies I worked with, except extremely
small startup, where tech guidance is like non-existent, the least restrictive
when it comes to languages. It fits the Amazon's self perception, it is
pragmatic, non opinionated, moral-agnostic. It cares about customer, because
customer means potential of eventual profit. However you achieved that, is not
Amazon's concern.

~~~
throwaway504
The L8 manager of our org declared that all projects must be on the JVM, and
if the language wasn't Java, then we would need to submit a six pager
explaining why. Perhaps he was just being conservative, but to me there's
something off-putting when those types of decisions come from managers rather
than from the engineers.

That said, our org did manage to have a few projects that aren't Java, so I
guess I would have to concede that the Java claim is factually false as well.
But it sure doesn't feel like I am empowered to choose something other than
Java.

------
lewisjoe
I worked at a startup and then switched to a bigger product company: Zoho.

If there's one thing I'd take away for the rest of my career from Zoho, it
would be frugality in adopting the latest of tech.

When NoSQL was all the rage, the company stood firm that relation databases
had rock-solid mathematical foundation and stayed away from the bandwagon. It
paid off.

When every other company wrote blogs about rewriting their software in NodeJS,
Ruby & Python, the company stood ground with statically typed languages. It
paid off.

My own team, Zoho Writer has a strong policy against incorporating third party
libraries without good reasons. This way, the product is nearly a decade old,
but the JS size has remained surprisingly small, all through its evolution.

I believe staying frugal in adopting the latest hype, can only be reasoned
about in hindsight.

~~~
geggam
I have shared this blog posting more than I care to admit.

[https://mcfunley.com/choose-boring-technology](https://mcfunley.com/choose-
boring-technology)

It amazes me that people will put their companies and employees at risk by
using the new cool stuff just because everyone else is ( looking at you k8s )

~~~
mmartinson
Definitely, but, I'm sure that adopting new technologies at the right time can
be a huge competitive edge for startups. Sure it can fail horribly in some
places where hindsight will show that the trusted stack was the right choice,
but it could also mean winning a race to market and continued relevance for a
company if it works out.

~~~
geggam
Startups are complete risk and should jump on the new cool and shiny toy ship.

When you start making money or wish to start making money you need to
stabilize the insanity.

I have done both.

~~~
blub
Something that's "complete risk" should take even more risk? Doesn't seem
logical.

~~~
geggam
so when you are swimming you hold one leg in the air or do you keep everything
in the water so you can swim faster ?

------
ilovecaching
One misconception people get wrong, from my perspective, about most FAANG is
that you get to work with new and shiny things, especially new langauges. This
is really more applicable to startups where risk taking is in the DNA, you
will mostly get really strong pushback because there is just too much effort
involved to support more than two or three languages at scale. There are
normally niche languages, but they are essentially statistical anomalies and
are usually borne out of a real business need (like swift for iOS). Mostly
engineering for engineering sake is also going to be frowned upon unless it
really helps the business.

I do agree that Amazon is the worst in regards to OSS. They really need to fix
that, even if jut for PR, because they are consuming so much of it for AWS.

~~~
umanwizard
Very good comment, +1.

People don't really get the rabbit hole that is necessary in order to
introduce a new technology at a large company. Just off the top of my head, a
few of the issues:

* Not everyone has followed the broader technology world outside the big company. It's totally possible to imagine a situation where nobody in your management chain or team has even heard of something relatively mainstream like Docker.

* Your company uses custom build, deployment, and dependency management systems. Someone will have to do the work to implement support for the new technology in these.

* If the new technology has its own opinionated ideas about how to do any of the above, like Rust crates, npm, pip, etc., forget about it. You need to interact with previously existing internal code which means you need to fit into the already existing build/deployment/dependency solution.

* Your company has a bunch of custom internal services. You need to create bindings for these services' APIs in the new language.

* Your company might be doing some esoteric stuff that the new technology has no support for, like for example if your network stack does HTTP in some sort of custom-tuned way for performance, and the new language's HTTP library only supports a subset of everything that's possible.

* The new technology may have correctness or performance issues that have never been discovered because it has never been used on many thousands of servers where a 1% difference in CPU makes a huge difference, or it has never been used with a codebase big enough to take >1 day to build, or with binaries larger than several GB, etc.

And the biggest one...

* Every day, many new people are joining the company and beginning to ramp up on the (internally) mainstream solutions that have institutional inertia. This will dwarf the speed at which you can convince people to switch to the new technology.

The pattern I have seen is that something starts off being used for one-off
scripts that don't have a lot of dependencies on the existing gargantuan
infrastructure, then very, very gradually gaining mindshare internally and
becoming more supported.

~~~
Frost1x
"...or it has never been used with a codebase big enough to take >1 day to
build..."

I've not worked in this scale of enterprise environments so I find this tidbit
fascinating. Why does the codebase take over a day to build? Does that time
include rebuilding and deploying the underlying infrastructure?

I'm genuinely interested here. I can only imagine a few highly niche cases
myself so I'm curious what I'm missing here.

~~~
umanwizard
Hard to find updated publicly-available information on this, but it was
claimed that in 2003, Windows took 12 hours to build, on a multi-machine build
farm: [https://www.itprotoday.com/windows-server/supersite-
flashbac...](https://www.itprotoday.com/windows-server/supersite-flashback-nt-
s-first-decade)

------
vonseel
_As far as I know, Node.js was used inside in a limited capacity, but they had
an alternative for npm for security reasons, and you had to get an npm package
approved to use it internally._

FWIW, this is a _very good thing_. A company as large as Amazon should do this
with all of their repositories; even a small start-up should be doing this to
mitigate suspicious packages / third-party code vulnerabilities.

~~~
jrockway
I worked on third-party package approvals at Google. The reasoning behind
reviews was largely due to license compliance. If the license said "you have
to display this license to end-users" then we had to make sure that the
license was machine-readable and would be automatically bundled into the build
to be displayed in that "open source licenses" section of pretty much every
app ever. If the license said "by linking this into your code, you have to
opensource all code at your company", we had to deny it. That sort of thing.

We suggested that people get security reviews, but it was up to the user of
the package to figure out whether or not that was necessary. Often security
reviews would be blocking the project's launch and would be done at that time.

The final thing we enforced was a "one version" policy. If everyone was using
foobar-1.0, and you wanted to use foobar-2.0, it was on you to update everyone
to foobar-2.0. This was the policy that people hated the most, but basically
mandatory at the time because none of the languages widely used at Google
supported versioned symbols. Having library A depend on foobar-1.0 and library
B depending on foobar-2.0 meant that application C could not depend
transitively on library A and library B at the same time, which would cause
many disasters.

~~~
nerdponx
This all sounds like justified inconvenience in the name of safety at scale.
As long as you have a good relationship with the package approval guys this is
a fine workflow in my experience.

~~~
jrockway
Yeah, I thought it was fine (I was both a reviewer and a user).

I mostly posted to provide some contrast to the folks that are saying things
like "there are too many open-source libraries and they need to be approved to
make sure there aren't any security problems", which is not what we did. We
did not attempt to limit the use of libraries, nor did we vouch for the
security-readiness of packages before they were allowed to be checked in to
source control. If you think there are too many npm libraries and are looking
for an example of a big tech company saying "no more!", this is not it. Use
all the libraries!

------
raiflip
Don't want to go into too much detail but this article is like taking the
crappiest parts of the crappiest systems and declaring it representative of an
entire product. There is a lot of really good internal tooling not mentioned
here, and for the internal tooling mentioned here (like Apollo) absolutely
none of its benefits are mentioned.

------
tanilama
Well, this article gets something right, but gets a lot of stuff wrong as
well. What can be confirmed is that his access to the Amazon tech scene is
limited, and he takes a sweeping generalization that is how the whole Amazon
works.

Disclaimer: Ex-Amazonian, left like one year ago.

~~~
coreyoconnor
I agree.

Disclaimer: Also ex-amazonian, left like one year ago.

~~~
jjoonathan
What happened like one year ago?

~~~
gouggoug
Well, like 2 amazonians left.

------
kerng
I like how Amazon has an MAWS movement internally, meaning "Move to AWS". I
think most people think that they use AWS mostly, but they dont.

Its an interesting look behind the scenes at Amazon and how antiquated they
appear to operate. Makes you wonder if Azure and Google have pretty good
chances beating them down the road.

Edit: Interesting, further down one person commented that Amazon doesn't use
AWS broadly because it's seen as not secure enough for certain workloads.

~~~
jcrites
Based on my experience, that information and some of the comments about it in
the thread are out of date or inaccurate.

'Move to AWS' was a program focused on accelerating AWS adoption that was
primarily active something like 5-7 years ago. The program achieved its goals
and concluded: virtually all infrastructure was running on AWS. I worked on
the program for part of that time, in the last couple of years it was active.
Amazon's migration to AWS was was covered in a 2012 presentation at AWS
re:Invent: "Drinking Our Own Champagne: Amazon's Migration to AWS" [1].

Some more recent efforts around AWS usage were covered in a 2016 talk: "How
Amazon.com Uses AWS Management Tools" [2] (which references the earlier talk
and discusses some of the changes since then). There are ongoing projects to
improve and optimize usage of AWS, as well as to adopt some of the newer
services.

[1]
[https://www.youtube.com/watch?v=f45Uo5rw6YY](https://www.youtube.com/watch?v=f45Uo5rw6YY)
[2]
[https://www.youtube.com/watch?v=IBvsizhKtFk&t=13m20s](https://www.youtube.com/watch?v=IBvsizhKtFk&t=13m20s)

~~~
mrep
Until they move off of sable, their NoSQL backend for retail, calling them
mostly on AWS is laughable especially considering their major prime day outage
this last year was caused by sable not being able to dynamically scale up [0].

[0]: [https://www.cnbc.com/2018/07/19/amazon-internal-documents-
wh...](https://www.cnbc.com/2018/07/19/amazon-internal-documents-what-caused-
prime-day-crash-company-scramble.html)

~~~
amzn-throw
Depends on what you mean by mostly. Almost all code runs on EC2. All object
storage is in S3. Almost all service asynchronous interactions are decoupled
via SQS. Almost all notifications are shared via SNS.

Yes, most legacy systems use Sable but most new development for the last 2
years uses DynamoDB.

More and more event-driven applications with highly variable throughout are on
Lambda. Even on AWS Service teams.

The Sable outage had a particularly laughable interpretation by the armchair
quarterbacks from CNBC. The kind of scale involved makes Oracle-based
approaches completely infeasible. (the internal Correction of Error document
leaked and was wildly mis understood)

And the scaling timelines involved are so compressed that no company in their
right mind "dynamically scales up as a result" \- it's always projected
scheduled scaling. There were other cascading effects.

I know it's fun to dunk on Amazon but their commitment to operational
excellence is unparalleled as public AWS post mortems after major events
should reveal.

If you read the entire Correction of Error document on tbe Sable outage you'd
agree, of course CNBC would never publish that, one gets more clicks by
getting a professor out of touch with the realities of production software
engineering to blurb some juicy quotes.

------
ex_amazon_sde
Ex-Amazon SDE checking in. The article is quite misleading.

The author confuses "shiny" with "good".

Amazon does package-based deployments because it scales well and allow
engineers from many different teams to work on packages and also provides fast
security updates.

Amazon used VMs more than a decade before container engines and the latter are
still lacking security and stability.

Having worked in many companies, I would take Amazon's engineering practices
over the modern shiny devops tool ecosystem every day.

I agree that Apollo is slow (due to the implementation) and has an ugly UI,
and that the company has a very poor track record of contributing to OSS.

------
cimi_
Context: I spent five years as an engineer at Amazon, the last two as a tech
lead on an internal developer tool (think SaaS for performance engineering).

This article is not untrue but it misses the fact that teams are empowered to
own their solutions are not restricted in _how they setup their environments
and which tools they use_. It's true that fixing these problems feels like
wasted effort, it's by design: Amazon operates as many separate internal
entities and I think replication of effort is an acknowledged downside of
operating this way.

> 1\. Deployments > Their internal deployment tool at Amazon is Apollo, and it
> doesn't support auto-scaling.

I had to manually scale up my service once in two years and we weren't over-
provisioning wastefully. Before I left my product was supporting +40K internal
applications with an infra+AWS cost < 2k / month.

We had good CI with deep integration with Apollo, you could track any change
across the pipeline, we had reproducible builds and we had a comprehensive
deployment log listing all changes.

Apollo is sloooooow though and the UI is very 90s.

> 2\. Logs > Any self respecting company running software on distributed
> machines should have centralized, searchable logs for their services.

We were using Elastic Logstash Kibana powered by AWS ElasticSearch. I wrote a
thin wrapper around logstash that was used in over 1K environments internally,
so weren't the only ones doing this.

> 3\. Service Discovery > What service discovery? We used to hard wire load
> balancer host names in config files.

Agree with this one. I will never forget the quality time I spent configuring
those load balancers and ticketing people about DNS.

> 4\. Containers

As other commenters mentioned, if you want to use containers, you're free to
bypass all of this and run your service in AWS where you can use ECR, EKS etc
if you want.

> (As far as I know, Node.js was used inside in a limited capacity, but they
> had an alternative for npm for security reasons, and you had to get an npm
> package approved to use it internally.)

I built my UI from scratch using create-react-app and yarn offline builds (no
mystery meat) and I bypassed all the internal JS tooling, which I thought was
very poor. This was changing though.

Finally, my personal anecdote: you could onboard our product in less than an
hour (including reading docs), it required no further maintenance and gave you
performance stats for free. So not all was bad :)

~~~
cimi_
> They have some amount of Rails, and JavaScript has to be there, but if you
> want to experiment with, say, Go, Kotlin, or anything else, you are going to
> get nothing but push back.

I missed this - starting 2018 we were writing all our backend logic in Kotlin
and we got no push back from anyone.

------
throwaway1280
Ex-Amazon engineer of several years here.

This is a pretty interesting article, but it's important to know that Amazon's
internal tooling changes pretty fast, even if it's mostly several years behind
state-of-the-art.

Exhibit A: Apollo

Apollo used to be _insane_. It was designed for the use case of deploying
changes to thousands of C++ CGI servers on thousands of website hosts,
worrying about compiling for different architectures, supporting special
fleets with overrides to certain shared libraries, etc etc. It had an entire
glossary of strange terms which you needed to know in order to operate it.
Deployments to our global fleet involved clicking through tens of pages, copy-
and-pasting info from page to page, duplicating actions left right and centre,
and hoping that you didn't forget something.

When I left, most of that had been swept away and replaced with a continuous
deployment tool. Do a bit of setup, commit your code to the internal Git repo,
watch it be picked up, automated tests run, then deployments created to each
fleet. Monitoring tools automatically rolled back deploys if certain key
metrics changed.

Auto scaling became a reality too, once the Move to AWS project completed. You
still needed budgetary approval to up your maximum number of servers (because
for our team you were talking thousands of servers per region!) but you could
keep them in reserve and only deploy them as needed.

Manually copying Apollo config for environment setup was still kind of a thing
though. The ideas of CloudFormation hadn't quite filtered down yet.

Exhibit B: logs

My memory's a bit hazy on this one. There certainly was a lot of centralized
logging and monitoring infrastructure. Pretty sure that logs got pulled to a
central, searchable repository after they'd existed on the hosts for a small
amount of time. But, yes, for realtime viewing you'd definitely be looking at
using a tool to open a bunch of terminals.

The monitoring tools got a huge revamp about halfway through my tenure,
gaining interactive dashboarding and metrics drill-down features which were
invaluable when on-call. I'm currently implementing a monitoring system, so my
appreciation for just how well that system worked is pretty high!

Exhibit C: service discovery

Amusingly, a centralized service discovery tool was one of the tools that used
to exist, and had fallen into disrepair by the time this person was working
there.

This was a common pattern in Amazon. Contrary to the 'Amazon doesn't
experiment' conclusion, Amazon had a tendency to experiment too well - the
Next Big Thing was constantly being released in beta, adopted by a small
number of early adopters, and then disappearing for lack of
funding/maintenance/headcount.

I can't think of any time I hard-wired load balancer host names though.
Usually they would be set up in DNS. We did used to have some custom tooling
to discover our webserver hosts and automatically add/remove them from load
balancers, but that was made obsolete by the auto-scaling / continuous
deployment system years before I left.

As for the question of "can we shut this down? who uses it?" \- ha, yes, I
seem to remember having that issue. I think that, before my time, it wasn't
really a problem: to call a service you needed to consume its client library,
so you could just look in the package manager to see which services declared
that as a dependency. With the move to HTTP services that got lost. It was
somewhat mitigated over the years by services moving to a fully authenticated
model, with client services needing to register for access tokens to call
their dependencies, but that was still a work in progress a few years ago.

Exhibit D: containers

Almost everything in Amazon ran on a one-host-per-service model, with the
packages present on the host dictated by Apollo's dependency resolution
mechanism, so containers weren't needed to isolate multiple programs'
dependencies on the same host.

Screwups caused by different system binaries and libraries on different
generations of host were a thing, though, and were particularly unpleasant to
diagnose. Again, that mostly went away once AWS was a thing and we didn't need
to hold onto our hard-won bare-metal servers.

'Amazon Does Not Experiment'

Amazon doesn't really do open source very well. The company is dominated by
_extremely_ twitchy lawyers. For instance, my original employment contract
stated that I could not talk about any of the technology I used at my job -
including which programming languages I used! Unsurprisingly, nobody paid
attention to that. That meant that for many years, the company gladly consumed
open source, but any question of contributing back was practically off the
table as it might have risked exposing which open source projects were used
internally.

A small group of very motivated engineers, backed up by a lot of open-source-
friendly employees, gradually changed that over the years. My first ever
Amazon open source contribution took over a year to be approved. The ones I
made after that were more on the order of a week.

Other companies might regard open sourcing entire projects as good PR, but
Amazon doesn't particularly seem to see it that way. Thus, it's not given much
in the way of funding or headcount. AWS is the obvious exception, but that's
because AWS's open source libraries allow people to spend more money on AWS.

Instead, engineers within Amazon are pushed to generate ideas and either
patent them, or make them into AWS services. The latter is good PR _and_
money.

As for different languages: it really depends on the team. I know a team who
happily experimented with languages, including functional programming. But
part of the reason for the pushback is that a) Amazon has an incredibly high
engineer turnover, both due to expansion and also due to burnout, so you need
to choose a language that new engineers can learn in a hurry, and b) you need
to be prepared for your project to be taken over by another team, so it better
be written in something simple. So you better have a very good justification
if you want to choose something non-standard.

Overall, Amazon is a pretty weird place to work as an engineer.

I would definitely not recommend it to anybody whose primary motivation was to
work on the newest, shiniest technologies and tooling!

On the other hand, the opportunities within Amazon to work at massive scale
are pretty great.

One of the 'fun' consequences of Amazon's massive scale is the "we have
special problems" issue. At Amazon's scale, things genuinely start breaking in
weird ways. For instance, Amazon pushed so much traffic through its internal
load balancers that it started running into LB software scaling issues, to the
point where eventually they gave up and began developing their own load
balancers! Similarly, source control systems and documentation repositories
kept being introduced, becoming overloaded, then replaced with something more
performant.

But the problem is that "we have special problems" starts to become the
default assumption, and Not Invented Here starts to creep in. Teams either
don't bother searching for external software that can do what they need, or
dismiss suggestions with "yeah, that won't work at Amazon scale". And because
Amazon is so huge, there isn't even a lot of weight given to figuring out how
other Amazon teams have solved the same problem.

So you end up with each team reinventing their own particular wheel, hundreds
of engineer-hours being logged building, debugging and maintaining that wheel,
and burned-out engineers leaving after spending several years in a software
parallel universe without any knowledge of the current industry state-of-the-
art.

I'm one of them. I'm just teaching myself Docker at the moment. It's pretty
great.

~~~
throwaway1280
Speaking of twitchy lawyers and Move to AWS... one of the weirdest things we
had to deal with inside Amazon was that, for many years after AWS launched, we
weren't allowed to use it because it "wasn't secure enough".

Given that we were actively shopping it around to major financial institutions
at the time, doesn't that strike you as particularly hypocritical? :)

~~~
Hamuko
So wait, when I need to convince customers why AWS is secure for their data, I
can't say "It's good enough for Amazon!"?

~~~
jtoberon
To clarify, an AWS customer has a shared responsibility to describe the
security of their systems including how they use AWS tools, and in this
respect Amazon is no different than other AWS customers.

------
alpb
Someone should probably add (2018) to this post as it's from May 2018.

~~~
wglb
Email hn@ycombinator.com and they will take care of it.

------
throwaway772643
I will be joining Amazon in about a month.

Is there _any_ chance I'll be able to work on OSS and/or "modern" tech (e.g.
containers, Go, etc.) without a ton of push-back?

It also seems Amazon is obsessed with reinventing wheels and keeping their
stuff internal, which is worrying. Is there any chance to introduce solid OSS
tools to the development process? (whatever they might be)

~~~
discodave
AWS SDE here.

The short answer is, in order to get your team to adopt something, you need to
make the case that it's better for customers (including things like migration
costs). If the modern thing is more efficient, is higher availability,
increases velocity, and so on then the case can be made.

Some specific examples based on things you cite:

* For an example of something OSS or "modern" coming from AWS, checkout Firecracker (written in Rust): [https://firecracker-microvm.github.io/](https://firecracker-microvm.github.io/)

* With regards to "reinventing wheels" Apollo + EC2 solves a lot (not all) of the problems that containers solve, and existed for years before containers became the hotness.

* Docker, which brought containers to the masses launched in 2013.

* EC2 launched in 2006 (7 years before Docker).

* Apollo (and the build system Brazil) predated EC2 by many years.

* Amazon.com was migrating to EC2/AWS before 2012 ([https://www.youtube.com/watch?v=f45Uo5rw6YY](https://www.youtube.com/watch?v=f45Uo5rw6YY))

* Another example, Lambda, which launched in 2014 runs on EC2 ([https://www.youtube.com/watch?v=QdzV04T_kec&t=1611s](https://www.youtube.com/watch?v=QdzV04T_kec&t=1611s)).

* New services get to build in AWS and use Lambda, ECS, DynamoDB etc based on their business needs.

~~~
mundu_wa_hinya
Worked for EC2 as an SE ~ 5 years ago. We used to handle rack downs and page
the relevant team if there was a large set of instances for the said team
impacted. We once had a couple of hundreds of Amazon.com (merchant team)
instances impacted. We paged the team and they were like "don't page us for
anything less than 10 racks of our instances down". The burgers didn't even
feel the impact. Their automation was insane.

------
PaulHoule
Getting an npm or other package approved for internal use is not an unusual
practice.

~~~
wmf
Yes, but it probably makes Node.js useless in any such company since any non-
trivial app will have 1,000 npm dependencies.

~~~
throwaway1280
Yeah. I did this once a few years ago, and it was quite unpleasant. Did get it
done in the end, but it definitely put my team off looking for any other
useful NPM packages.

I wonder if it's any more streamlined now?

~~~
thramp
There's some internal builds tools that I can vouch for. If you're still at
Amazon, feel free to ping me at dbarsky@ and we can chat.

------
presty
The OP needs to put a date on the article, because AFAIK things are very
different in 2019.

Also, it's interesting how they equate "experimentation" with "open source".

------
stretchwithme
Considering the constant stream of new services and feature, the lack of OSS
is insignificant compared to value they add to the world.

Like the fact that you can create an SSL/TLS certificate for free for load
balancers without the usual agony. So easy.

------
sumanthvepa
I worked at Amazon in the late 90s. So my experience is most likely not
relevant anymore, but I will make a few observations. First, I see that many
commenters disagree with the OP, they had a different experience of Amazon,
one where they were working with infrastructure that was responsive. modern,
easy to use etc. It very possible that both observations are correct. In a
large company, not all parts of the company will be using the same
infrastructure at the same time. Indeed, I would be dangerous for the entire
company to upgrade lockstep to a new technology infrastructure. Second, in
most companies, innovation is not measured by the novelty or newness of the
language or framework you use, but by the business impact your product or
service makes. Much of Amazon's innovation was, and is, around business
models. Indeed, when I worked at AMZN, I was writing C code (to power a
website) using beautifully efficient database access code written by Sheldon
Kaphan. There was nothing remotely advanced about the language. It took 9
months for me to get a 3 line code change into production. And I was using
technology that predated Apollo (it was called Huston) There is nothing
particularly wrong about that either (it was a potty mouth filter and was
blocking some obscure swear words, and no one was too worried that the
component it was part of didn't ship for the best part of a year.) I now run
my own company, and I both manage technology, people as well as write code. I
find myself exercising the same conservatism with respect to code and
infrastructure that I found at Amazon, and for the same reasons. It is
expensive and potentially company destroying to switch languages and core
technologies. It is best not done, or if done at all, done with a lot of care
and slowly.

------
femto113
I was once pitched a startup founded by some ex-amazonians whose big idea was
"Apollo for everyone". They were nonplussed by my spit take.

For the people saying "I worked there and it wasn't like that" I wonder if you
worked in retail. It's a very different world from the more modern bits of the
company.

------
just_passing_by
What amuses me is that most of retractions come from ex-Amazonians not from
the current staff. This is the only company i know dealing with that much
criticism from engineering.

Even more to add is that the article is more or less fresh and at Amazon's
scale i doubt any major changes had undergone in the last 10 months.

~~~
amzn-throw
There is a comment above you from an Amazon Principal Engineer:
[https://news.ycombinator.com/user?id=jcrites](https://news.ycombinator.com/user?id=jcrites)

His profile says "Architect and cofounder of Simple Email Service. Creator of
Cloud Desktop, a cloud-based development environment used by most Amazon
engineers. Technical lead for Amazon's strategy for using AWS."

Can't get more "From the horse's mouth than that"

We are generally asked not to comment on stuff like this because of how easy
it is to reveal confidential internal details.

For the record, the article is mostly wildly out of date, but others have
already corrected the record.

~~~
just_passing_by
I'm aware current employees comment over here. Two things to mention:
principal/architect roles are always based not only on a merit of skill but
politics. So, taken with a grain of salt.

Moreover, if you inspect the person's comments you will notice how "legally"
clean they are. Even the phrasing looks the same for both comments "based on
my experience", "this information if wrongful" etc. Looks like those were
refined by the legal team before posted.

I wonder all that, because since my first engagement with AWS and Amazon
recruiting around five years ago the engineering tone did not change. That
concerns in a way that putting an effort to get a job there may turn out a
major disappointment.

The cutthroat approach is nice sometimes as it adds the taste of competition
but the whole noise seems like you're about to get buried rather than played
or burned out.

~~~
amazon_throw
translation of the first two paragraphs is "I've decided what I think already
and if you say you work at Amazon now and are happy it's prima facie evidence
that you can't be trusted to be objective about it"

~~~
just_passing_by
Sort of.

If it's a liability to say nasty stuff about the employer and most of the ex-
ones throw some heat the nearest possible conclusion is that the truth tends
to bow to statements that are observed as outdated and controversial by
present employees.

It just can't be a smear campaign against just that one company.

------
throwawayamz27
The main benefit for amazon's tools is once you've been there a while you know
how they work, and all the complexity and bugs have been stripped out of them.
And because they force engineers to go oncall everyone has a pretty good idea
of how to fix things.

When you have SRE's spending all day creating the next new thing (generally
after deprecating the previous one with no replacement), you end up in a
situation where you forget how to say, rollback a bad deployment. Or scale a
fleet.

The problem with fancy infrastructure as code, containers and logging services
is when they break you have no idea how to get out of trouble. SSH and grep
almost always work, as does symlinking a directory.

------
jypepin
> "It's complicated, so it's gotta be good. I must be dumb to not get it."

Having worked with aws a lot recently, this article doesn't surprise me at all
actually. When you see the low quality of UI and documentation for most of
their tools that users pay for, I wouldn't expect their internal tooling to be
any better.

I'm not saying their tech is bad, once things work, it works great - I'm
talking about the usability of those tools as I try to use them, and it makes
me feel the same as the OP's quote.

~~~
geggam
If you are using the UI to leverage AWS you arent really leveraging AWS.

AWS is designed to be used via automation and the API.

Disclaimer : I dont work for AWS but have spent many years building relatively
large stacks on AWS (thousands of ec2 instances with monthly spend being a
couple supercars in value)

~~~
jypepin
yes, but before using the API you need to know which API to use. If you are
looking into using a new service, say, analytics. Which coupling of services
do I use, between cloudwatch, firehose, etc etc. Or reading documentation for
example. They have terrible documentation.

Someone using their API/cli is already familiar with aws and how it works. The
point, before getting to that point, you need good docs and UI to allow people
to discover and learn.

------
ctvo
The author is wrong about his description of tooling and best practices at
Amazon.

I'm also annoyed that the post isn't dated anywhere so there's no way for me
to tell if it's just old.

------
vp8989
Looking through the rest of the blog, it looks like the author worked at
Amazon between 2014 and 2016.

------
kartikrustagi
This article is so outdated from what the current state is and has been for
quite a while now.

------
IloveHN84
To me, it sounds like a company who is still working with the mentality of
90s. The same I'm facing at work, where new technology is treated as scary and
there's fear of change something that (somehow, only God knows how) is working
and none is allowed to touch such things

~~~
HereBeBeasties
New technology _is_ scary. It should be scary. If you don't find it scary
you're probably not looking at it right. It's usually full of risk, which if
you are in management (or a remotely responsible employee) needs to be
evaluated and if possible contained, and ultimately needs to pay for itself
(i.e. Offers a pay-off that more than outweighs the risk).

If we are honest with ourselves, most new tech in this industry totally fails
that litmus test.

There are an awful lot of shiny-bauble chasers and snake oil salesmen both
inside and outside companies.

I don't know what your precise situation is and obviously it's very much a
spectrum rather than black and white, but a healthy aversion to promises made
by the authors and salespeople of relatively unproven new things is extremely
valuable.

------
satyajugran
I think there is no serious authenticity associated with this blog post.

------
anth_anm
This matches my experience using the internal tools pretty well.

But hte team I was on was mostly using AWS, which meant the tooling was
better.

We also had a bunch of people who wanted python instead of Ruby, so they
started using python and eventually we were just a team that used python.
Cool.

Apollo is probably the worst thing. Brazil, Pipelines, etc... those are mostly
fine.

Amazon has done a ton with the JVM. They don't need something new, so they
don't bother. That's fine.

They also do adopt other tech as needed to do things. I know people who worked
in Go, because it was container stuff.

This was also one of things that gave me pause when considering interviewing
with snap a while back. So many former Amazon people. Seemed like a lot of
stuff was going with "just copy amazon, make it a bit better". I don't want to
write java. bleh.

Anyway, this article is going to have a lot of people saying "oh no that's not
right". It is. There are exceptions but overall it's pretty much bang on.

Oh and open source. My understanding is jeff doesn't like contributing back.
The company doesn't like contributing back. They keep an iron grip on IP. They
are ridiculous about letting employees do side projects. They fight back with
every single FOSS contribution and even after some of our senior guys did a
whole bunch of work it still required multiple layers of approvals and a whole
bunch of hooops and an extra training course and blah blah blah blah. It's
really really dumb. I find it crazy that Amazon doesn't get more flak for
being probably the worst company for open source around right now.

But MS has tossed out their old CEOs, has tons of interest in making open
source work well for them, and contributes loads... still get shit on.

~~~
eclipxe
Working on a side project is super easy. Just file a TT (or I guess SIM now?).
As long as it isn't competing and not a game, it's generally quick approval.

~~~
granzymes
What is the issue with games?

~~~
eclipxe
Amazon owns game studios and might have many different game concepts being
worked on, such that an employee might inadvertently compete with a similar
game studio idea.

------
torqueTorrent
In my career I've experienced alot of engineers that had a desire to shy away
from command line or command prompt tools, shell, CMD.exe, batch, scripting,
cron and related 'traditional' automation, in favor of GUI, IDE, html, browser
etc.

I've even had some young sexy angular-wizard type engineers that had the ear
of mgmt sarcastically respond with statements like "I don't do command line".

This article and my experience with AWS development and Amazon leads me to
believe this entire company is led and staffed by such engineers.

------
mnm1
It's hardly surprising. The quality of their public products isn't much better
than what's described here. It's fine for companies with plenty of engineers,
money, and time to base their tools off of, but that's about it. Without
building a ton of extra tooling and having specialized information only
available through paid support, it's almost impossible to operate anything on
aws. The documentation is plentiful, mostly out of date, wrong, incomplete,
and difficult to browse, search, and use. Doing devops using aws is a
nightmare that never ends. Not to mention the speed of deploying anything is
beyond slow, so any work takes many times the amount of time it should. For
large companies with plenty of resources, these are minor points. For small
and medium sized ones, it's a loss of productivity and money that simply
cannot be justified over other methods.

~~~
banku_brougham
Dang probably true for the products you were using but Redshift has been great
for me, including the documentation which I have never found to be out of
date.

I wouldn’t say its the economical option though.

Oh and its high time ‘count(distinct) over()’ partitions was supported.

