
Scaling Etsy - luu
https://twitter.com/mcfunley/status/1194713711337852928
======
malisper
As someone who scaled the database at a company that dealt with over a
petabyte of data, here's a few thoughts I have:

It seems like the biggest issue Etsy had was a lack of strong engineering
leadership. They relied on an outside consultant to rewrite their application
in a fundamentally flawed way. The lack of eng leadership resulted in both
poor technical decisions being made and bad engineering processes for how the
app was developed.

Etsy's biggest technical mistake building an application using a framework
that no one understood. This led to the predictable result that when they
deployed the application to production, it didn't work. Even if the
application had worked, Etsy still would have needed to maintain the Twisted
application indefinitely. Maintaining an application written in a framework no
one understands sounds like a recipe for a disaster. Sooner or later you are
going to run into issues that no one will know how to fix.

Process wise, Etsy made the mistake of not properly derisking the Twisted
application. They only found out that it didn't work when they were deploying
it to production. They made the same issue a second time when they tried to
deploy the replacement. When I'm building a new service, the first bit of code
I write is to test that the fundamental design of the new service would work.
Usually this only takes a few days instead of the months it takes to implement
the full service. It sounds like Etsy could have setup a simple Twisted app
that did a fraction of what the final version did. If so, they would have
found a number of fundamental flaws in the design before having spent a ton of
time building out the service.

To be honest, this story shows how a business can succeed even with a bad
engineering team. It would be one thing if this sort of incident killed off
Etsy. Instead Etsy has gone on to become the $5 Billion company it is today.
I'm not saying engineering doesn't matter. All I'm saying is you can build a
pretty successful business even with a bad engineering team.

~~~
yoz
_All I'm saying is you can build a pretty successful business even with a bad
engineering team._

I completely agree. The most important lesson I’ve learned in the past decade
of software development is:

Good product/UX design can save bad engineering, but it doesn’t work the other
way around.

~~~
endymi0n
100% this.

Tackling a real problem > Having a big addressable market > Good product / UX
> Great Engineering.

Strictly.

Great Engineering pays back at a much larger time scale and makes for an
enabler or a breaker at the critical scaling stage only. You can get
surprisingly far with a protoduction service and still build something huge
(see Twitter).

Great design is atomic and homogenous. Weak technology leadership ends up in
opening up a vacuum that gets filled by Architecture Astronauts (Etsy), every
team doing their own thing technologywise (SoundCloud), or at the very worst,
some dogmatic homebrew NIH stack (various places I‘ve worked at).

Every unhealty Technology organization looks unhealthy in its own way, but the
great ones look all alike: Clean, simple, rather homogenous and logical.

~~~
StreamBright
This is only true with boundaries around engineering quality. There can be not
so great engineering making it impossible for the business to be successful
because their solution is more expensive than the revenue generated.

~~~
bpicolo
Depends on the type of software. If you're building transactional software
where your business is doing a few bucks cut per user transaction, the
software can be tremendously bad before it hits into revenue, as long as
you’re not totally busting the user experience

------
neya
On an unrelated note, I hate how this has become the trend for long-form
content- What should've been a normal blog post is now being chunked into
smaller distracting paragraphs with each containing its own paths to deviate
from the original topic (comments, retweets, etc.)

I wonder what's the motivation though - is it the likes? The informal setting?
I really miss those days when we just had content sitting with default font
styles inside plain 'ol HTML tables.

~~~
sirn
Though not universally the case, I like to quote @foone's thread on why he
publish on Twitter and not a blog post:

[https://twitter.com/foone/status/1066547670477488128](https://twitter.com/foone/status/1066547670477488128)

(Threader:
[https://threader.app/thread/1066547670477488128](https://threader.app/thread/1066547670477488128))

~~~
DandyDev
Genuine tip then: write your story on Twitter, then copy paste the compiled
story from Threader into a proper blog post and replace your tweets with a
link to the blog post. Problem solved!

~~~
jeremyjh
People expect more from blog posts. Its ok for a Twitter thread to be
rambling, to backtrack and clarify the premise halfway through, but this is
not accepted in a blog post, which needs to be structured more like a
traditional essay and it _needs to be edited_. Someone without the attention-
span or patience to write a blog post in the first place is certainly not
going to go back and revise/edit a perfectly good tweet-storm just to make
someone on HN happy.

------
sethammons
I can sympathize. We have a Twisted database abstraction layer that was built
in house. You call http endpoints that perform database operations. It has
routing logic so applications only need to call someEndpoint.json and not
worry if we've sharded the backend recently, scaled read replicas, migrated a
write master, etc. It has caching, query piggy-backing, connection pooling,
and other bells and whistles.

While we have been able to scale it out for volume of requests, it did not
scale well for multiple teams and multiple services. It became a single point
of failure whereby the access patterns of one team to their own databases
could affect other teams abilities to access their own databases.

We've since moved to a new model whereby team's own their datastores and
access methods, speeding up development and reducing negative impact teams can
have on each other's data layer. Legacy access continues to be migrated off
endpoint by endpoint, database by database. I look forward to the day of never
having to look at Twisted code ever again. The framework is aptly named.

If you are looking for a similar solution for MySQL, I currently recommend
ProxySQL. It allows for a topography whereby teams can control their own proxy
layer and still have most of the benefits outlined above.

------
excerionsforte
Ha, Spouter a middle layer for talking to DB instead of just talking to DB
(proxy to do connection pooling, etc could be between). This kind of project
came up at my previous job and while I didn't know about Spouter at the time,
I knew I didn't like the idea one bit.

Managing a DB + middle layer for a small and an already stressed DB team?
Can't imagine that going well. The problem was that they spent too much time
doing reviews for simple SQL patterns for CRUD + operational issues like
debugging database performance issues or watching devs perform migrations. Of
course they have other things to do in their job.

My opinion was the root of the problem is database choice and practices. If
all I want to do is simple CRUD, then give me good scaled Redis cluster (AOF
enabled), dynamo or something that constrains the query model. DB team can
worry about about the cluster management leaving me to worry about structure.
I could consult DB team if I needed opinions for multiple cases. Give me a way
to watch db performance as well, so DB team does not need to watch over a
migration at 12am or some off period.

Sometimes I may need higher performance or more dynamic queries, so just
creating a table or elastic search with only indexable values to get the ids
works. Use those IDs to fetch from original store.

~~~
AmericanChopper
> My opinion was the root of the problem is database choice and practices. If
> all I want to do is simple CRUD, then give me good scaled Redis cluster...

Throughout my entire career I have come across very few simple crud
applications that will actually work properly with denormalized data
structures. It’s not even about scaling users, it’s about scaling your schema.
The very first instance you encounter where a document has a nested array will
start to cause problems, and the only way to solve it is to push more
complexity and state management onto the client. Which is bad interface
design, and quickly gets out of control.

If you really do have a simple crud app, and don’t want to put too much effort
into running your RDBMS, just use MySQL imo.

~~~
excerionsforte
I was working at a company of like 400 employees using MSSQL, so simple CRUD
is not limited to simple crud app. Simple CRUD operations are insert, select,
update, delete. Given that we auto generated the code and sql (stored
procedures) and pasted it into the codebase made me think, why bother with
this if I could develop new features without SQL at all.

Why should my queries be a stored prod or prepared stmt if all I need is get,
insert, update, delete an object. Yes, the auto generated code wrote to the
database as an object as in all fields included in the query. Why? There's no
transactions, foreign keys, etc. since some tables where stored in different
databases. Using a relational db as object store included people inefficiently
using it as an object store.

In my own projects, if I need the speed, columns are indexed while unindexed
data is binary. I can explode out (unserialize) the binary data into cache.

Point of my post was centralizing database access tends to be a bottleneck for
team efficiency. If not the devs then the db owners. Augment my abilities to
execute don't try to completely change it where it brings a whole set of
unwanted problems that were not there before.

~~~
AmericanChopper
I’m not saying you made an inefficient choice. If you only need to interact
with objects that have no relations or nested lists, then denormalized data is
likely not going to be a problem. But this is a remarkably niche use case. In
practice, nearly all cases I’ve seen where people have come to this
conclusion, it’s because they didn’t properly analyse their schema to begin
with.

That’s not to say all use cases for such technology are niche. A lot of CMS
applications can fit into that paradigm very well for instance.

I’d also say that storing binary data in an RDBMS is a seperate anti-pattern
all together.

~~~
excerionsforte
Every column you write to a db is binary at some point. When you choose a
type, all you asked the DB to do is interpret the stored binary data in that
way. By choosing binary type to store data, you've declared to the db not to
interpret the data. You don't hit issues with charset collation/encoding,
database interpretation (lack of 128bit ints), and etc. Think about it, what
is an integer? It is 4 bytes of data.

Anti pattern depends on use case. Yes, I absolutely do not agree with your
generalization. Storing binary data in an RDBMS is not an anti-pattern. Binary
data can be of any size. The bigger the binary data the less rows you should
expect to store. At some point (maybe the binary data is images/watermarks,
etc), you have to choose a replicated file system to use that as the part of
the datastore operations.

Furthermore, I've only worked on high qps applications, so maybe I'm a bit
biased on how to use the database efficiently. :)

~~~
Aeolun
Congratulations, you’ve offloaded to your app what your database is designed
to deal with.

Storing all data as binary is an anti-pattern too, regardless of your qps.

I wouldn’t store anything but metadata in the database, the blob can be
somewhere like S3.

~~~
excerionsforte
It depends on use case. If you cannot take that then agree to disagree and
discussion is no longer warranted. I've been on both sides and take my
experiences with me.

You can assert anti pattern, but knowing how to structure your tables matter.
SQL has BLOB, BINARY, VARBINARY choose the proper type depending on trade off.
Models that include blobs can be structured in DB to avoid IO issues (indexed
data includes id). Go to S3 with the id. How does what you are saying differ?
My first post literally says this.

Not only can be structured to avoid IO issues, but are protected via a cache
where the unindexed data is exploded.

Of course with high QPS I want to offload CPU cycles away from the DB. Scaling
the app is easier than a DB.

Scratching my head here. Are you arguing for just use only SQL? Why would I do
that.

~~~
orf
I’m sorry, but no matter what you think storing all unindexed columns as
binary is very, very weird. You can write all the documentation you want as to
why this is superior to what literally everyone else is doing and has been
doing since the dawn of time, but it won’t stop people joining your project
thinking that you are mad.

This kind of thing is the stuff of horror stories. Now I have no doubt that
you’ve convinced yourself that this is a great approach, but it’s not, and it
will be replaced as soon as you leave (assuming you’re not a one man band).

~~~
evanelias
> I’m sorry, but no matter what you think storing all unindexed columns as
> binary is very, very weird

You may be surprised to learn that several very large and well-known social
networks use this technique -- serializing unindexed columns into a blob
(typically compressed at some level) -- for their core product tables. It's
not really that "weird", if you consider that literally tens of thousands of
engineers work at companies doing this.

Conceptually it's exactly equivalent the same technique as putting all
unindexed columns in a single JSON or HSTORE column. Newer companies use
those; older companies tend to have something more homegrown, but ultimately
equivalent, typically wrapped by a data access service which handles the
deserialization automatically.

This technique is especially advantageous if you have a huge number of
flexible fields per row, and many of those fields are often missing / default
/ zero value, and the storage scheme permits just omitting such fields/values
entirely. Multiplied by trillions of rows, that results in serious cost
savings.

~~~
excerionsforte
It's a very boring technique really in hindsight. I guess one is really at the
top 1% when you work at a high scale company and you see techniques that
question the assumptions one holds. Databases like MySQL, as you pointed out,
are embracing this technique, but make the inner data indexable.

Also [https://www.skeema.io/](https://www.skeema.io/) looks like a good
product that I'll have to checkout. Looks like a better product so far
compared to solutions like flyway/liquibase. Full featured suite for DB
migration from dev -> prod is exactly what I've been raving about. Like it is
"boring" tech as in no one really wants to touch it, but it is the easiest to
screw up and products like this really take it to the next level.

The responses I've seen in this post appear to be from people who've never
used dynamic fields in a db and advise against it by saying it is an anti
pattern. If it is an anti-pattern, bring on all the anti patterns as I'd like
to not wake up at night or be pinged for slowness.

~~~
evanelias
Yeah, it's always interesting seeing the contrast between textbook computer
science approaches vs practical real-world solutions. For some reason, people
get especially hung up about academic CS concepts in the database world in
particular... I've found things are never that clean in the real world at
scale.

Thank you for the kind words re: Skeema :)

------
maxpert
This reminds me of similar debate I had with one of my friends falling for
NodeJS FOMO. The actual codebase was RoR and he wanted to go for Node for no
apparent reason. I was able to talk him out of it but what I understood from
discussion was peer pressure of “hey all of this could have been async”. While
I admit blocking threads might not be permanent solution, they can still take
you pretty far. Not to mention the folks wanted a MERN stack and I believe
leaving Postgres for Mongo without a damn good reason is just crazy idea.

~~~
demosito666
Isn't SQL -> mongo a meme now? I mean, are people still seriously considering
this outside of one or two very specific use cases? At my workplace the only
outcome of this would be an awkward silence and weird looks.

~~~
aniforprez
Yeah having worked on a mongo stack there are N problems caused due to the
unstructured nature of mongo and there's so many problems with existing ORM
solutions chiefly the fact that they're trying to create structure where it
doesn't exist. This was a decision taken aeons ago and there's not much we can
do to change it now but hope the guy who made that decision is happy cause I
sure am not

~~~
celim307
Only thing I’ve found mongo good for is the warm data path where you need
slightly more permanence than a straight pub sub (like getting the last 2
minutes of events upon connection) but you really don’t care about throwing
the data away after that

------
mcfunley
OP here—if it’s not obvious from the tweets the timeframe of this story is
2007 through 2008.

------
vegetablepotpie
>I managed to not get fired because through this whole thing I was talking
shit about it.

I know it’s unprofessional to comment on other developers work negatively, but
I do see it often in the work place and it does serve to distance an
individual from the group that is failing.

It’s sad but true, and this is an example, employers say they want
professionalism but they incentivize against it.

~~~
pm90
There’s 0 downsides to criticizing everyone else’s projects and systems so a
lot of people do it. This is kinda exacerbated by the fact that we engineers
have a reputation for being incorrect about the stability of systems we build
( there are good reasons for this, and I personally don’t think it’s possible
to avoid that impression even when it’s not intended).

I wonder if this team would have done better with tools that offer better
observability (metrics, logs, tracing etc.). An example of a rewrite that went
well [https://avc.com/2019/12/grinding/](https://avc.com/2019/12/grinding/)

------
PragmaticPulp
It’s amazing how relatable these anecdotes are. Change a few key words here
and there, and this tweetstorm could describe most engineering leadership
failures I’ve seen.

Nearly every tweet describes scenarios that can only happen when engineering
management is M.I.A. or too inexperienced to recognize when something is a bad
idea.

\- Attempting to solve problems with a rewrite in a different programming
language. This can be the correct long term decision in specific scenarios,
but it’s rarely the correct answer for bailing out a sinking ship. You need to
focus on fixing the current codebase as a separate initiative rather than
going all-in on a rewrite to fix all of your problems. Rewrites take far too
long to be band-aid solutions.

\- Rewriting open-source software before you can use it. If Django doesn’t fit
your startup’s needs, the solution is never to rewrite Django. The solution is
to use something else that does fit your needs. Your startup’s web services
needs are almost never unique enough to merit rewriting popular open-source
frameworks as step 1. Pick a different tool, hire different people if
necessary, and get back to focusing on the core business problem. Don’t let
the team turn into open-source contributors padding their resume while
collecting paychecks from a sinking startup. Save the open-source work for
later when you have the extra money to do it right without being a
distraction.

\- Hiring consultants to confirm your decisions. Consultants can be valuable
for adding outside perspective and experience, but the team must be prepared
to cede some control to the consultant. If you get to the point where you’re
hiring a “Twisted consultant” instead of a web store scaling consultant,
you’re just further entrenching the team’s decisions.

\- “Nobody was in charge”. Common among startups who pick a “technical
cofounder” based on their technical skills, rather than their engineering
leadership skills. When you assemble a team of highly motivated, very smart
engineers, it’s tempting to assume they can self manage. In my experience, the
opposite is true. The more motivated and driven the engineers, the more you
need explicit leadership to keep them moving in the same direction. Otherwise
you get the next point:

\- Multiple competing initiatives to solve the problem in different ways.
Letting engineers compete against each other can feel like a good idea to
inexperienced managers because it gets people fired up and working long hours
to prove their code is the best. That energy quickly turns into a liability as
engineers go full political, hoarding information to help their solution
succeed while casually sabotaging the competing solutions. In a startup, you
need everyone moving in the same direction. It’s okay to have disagreement,
but only if everyone can still commit to moving in the one chosen direction.
If some people can’t commit, they need to be removed from the company.

\- The “drop-in replacement” team. This is just a variation of having
engineers compete against each other. Doesn’t work.

\- Allowing anyone to discuss “business logic” as if it’s somehow different
work. This leads to engineers architecting over-generalized “frameworks” for
other people to build upon instead of directly solving the company’s problems.
At a startup, never let people discuss “business logic” as something that
someone else deals with. Everyone is working on the business logic, period.

I have to admit that when I was younger and less experienced, I plowed right
into many of these same mistakes. These days, I’d do everything in my power to
shut down the internal competition and aimless wandering of engineering teams.

Ironically, these situations tend to benefit from strategically shrinking
headcount. It’s not a fun topic, but it’s crucial for regaining alignment. The
key is to remove the dissenters, the saboteurs, the politicians, and the
architects creating solutions in a vacuum. You need to keep a cohesive core
team that can move fast, commit to one direction even when they disagree, and
not let their egos get in the way of doing the right thing.

The real challenge is that those employees tend to fly under the radar. The
people quietly doing the important work and shipping things that Just Work can
be overshadowed by bombastic, highly opinionated “rockstar” engineers.
Founders need to be willing to let those rockstars go when they no longer
benefit the company, no matter how good their coding skills might be in
isolation. A coordinated team of mediocre but diligent engineers will run
circles around a chaotic team of rockstars competing against each other.

~~~
pm90
> The real challenge is that those employees tend to fly under the radar. The
> people quietly doing the important work and shipping things that Just Work
> can be overshadowed by bombastic, highly opinionated “rockstar” engineers.
> Founders need to be willing to let those rockstars go when they no longer
> benefit the company, no matter how good their coding skills might be in
> isolation. A coordinated team of mediocre but diligent engineers will run
> circles around a chaotic team of rockstars competing against each other.

So much this. The bombastic Rockstars not only create unnecessary ego driven
political fights that are distractions to a business delivering value, but on
the occasions they do deliver, it’s often not what was promised. But their
social skills (or privilege) allow them to cruise through failures.

Whereas a team of devs that work well together that are hungry to learn,
willing to have an open minded conversation about systems, that deliver
consistently: they are what really keep the business from falling apart.

------
etxm
> an infamous incident where one of the investors had to drive to Secaucus to
> physically remove the other engineering founder from the cage.

I really want hear this story.

------
celim307
I’d love to hear more stories like this

------
paxys
Does anyone have a version that isn't 37 Tweets?

~~~
conickal
[https://twtext.com/article/1194713712516423681](https://twtext.com/article/1194713712516423681)

------
allard
"Interesting" "feature" of their platform that I see monthly, and more
recently given the season. I get a message like "where's my table?"

But I'm not a seller. My username is simple and a common first name. I've
approached them a few times, and there's no interest in fixing it. Yesterday,
I found where I could open/sever the connection between those messages and my
email. It's been going on for years.

Many confused users, although I have little idea how long they stay that way.

