Scaling Etsy

malisper · on Dec 22, 2019

As someone who scaled the database at a company that dealt with over a petabyte of data, here's a few thoughts I have:

It seems like the biggest issue Etsy had was a lack of strong engineering leadership. They relied on an outside consultant to rewrite their application in a fundamentally flawed way. The lack of eng leadership resulted in both poor technical decisions being made and bad engineering processes for how the app was developed.

Etsy's biggest technical mistake building an application using a framework that no one understood. This led to the predictable result that when they deployed the application to production, it didn't work. Even if the application had worked, Etsy still would have needed to maintain the Twisted application indefinitely. Maintaining an application written in a framework no one understands sounds like a recipe for a disaster. Sooner or later you are going to run into issues that no one will know how to fix.

Process wise, Etsy made the mistake of not properly derisking the Twisted application. They only found out that it didn't work when they were deploying it to production. They made the same issue a second time when they tried to deploy the replacement. When I'm building a new service, the first bit of code I write is to test that the fundamental design of the new service would work. Usually this only takes a few days instead of the months it takes to implement the full service. It sounds like Etsy could have setup a simple Twisted app that did a fraction of what the final version did. If so, they would have found a number of fundamental flaws in the design before having spent a ton of time building out the service.

To be honest, this story shows how a business can succeed even with a bad engineering team. It would be one thing if this sort of incident killed off Etsy. Instead Etsy has gone on to become the $5 Billion company it is today. I'm not saying engineering doesn't matter. All I'm saying is you can build a pretty successful business even with a bad engineering team.

wayoutthere · on Dec 22, 2019

> It seems like the biggest issue Etsy had was a lack of strong engineering leadership.

Know many engineers at Etsy; this is absolutely the problem and it persists to this day. The engineers are fine and do their best to self-organize but there is no cohesive engineering vision. They complain about this issue over beers with some frequency.

etsyanon123 · on Dec 22, 2019

Yes 100%. Its not so much as how should we architect/build as we don't even know what problems we're solving and what is important before embarking on large projects/refactors/rewrites. I don't think unique to any org but many rewrites are done based on tech trends/popularity rather than actually solving a problem.

Even if the python rewrite didn't immediately fail what would success look like? What does failure look like? How to measure?

wayoutthere · on Dec 22, 2019

From a lot of what I've heard there is trouble hiring / retaining line-level managers -- engineering managers are usually the ones thinking about and negotiating roadmaps among multiple teams. When you have developers filling this role ad hoc, it can be difficult for them to context switch from low-level technical concerns to operational business concerns. So the low-level technical concerns are the ones that get focused on rather than being focused on some overarching business-led strategy.

Not picking on Etsy here in particular as lots of companies have this problem. But it's not something people often think about as being crucial for good engineering culture.

yoz · on Dec 22, 2019

_All I'm saying is you can build a pretty successful business even with a bad engineering team._

I completely agree. The most important lesson I’ve learned in the past decade of software development is:

Good product/UX design can save bad engineering, but it doesn’t work the other way around.

endymi0n · on Dec 22, 2019

100% this.

Tackling a real problem > Having a big addressable market > Good product / UX > Great Engineering.

Strictly.

Great Engineering pays back at a much larger time scale and makes for an enabler or a breaker at the critical scaling stage only. You can get surprisingly far with a protoduction service and still build something huge (see Twitter).

Great design is atomic and homogenous. Weak technology leadership ends up in opening up a vacuum that gets filled by Architecture Astronauts (Etsy), every team doing their own thing technologywise (SoundCloud), or at the very worst, some dogmatic homebrew NIH stack (various places I‘ve worked at).

Every unhealty Technology organization looks unhealthy in its own way, but the great ones look all alike: Clean, simple, rather homogenous and logical.

StreamBright · on Dec 22, 2019

This is only true with boundaries around engineering quality. There can be not so great engineering making it impossible for the business to be successful because their solution is more expensive than the revenue generated.

bpicolo · on Dec 22, 2019

Depends on the type of software. If you're building transactional software where your business is doing a few bucks cut per user transaction, the software can be tremendously bad before it hits into revenue, as long as you’re not totally busting the user experience

m0zg · on Dec 22, 2019

>> can save bad engineering

Depends on how "bad". I've seen _many_ counterexamples, where no amount of lipstick could fix the underlying flaws. What good is a pretty website if it's slow, or if it doesn't work at all half the time.

Aeolun · on Dec 22, 2019

I see this fixed all the time by throwing more resources at it.

Case in point, all servers were configured to accept a maximum of 5 connections, so when the load balancer got more than x concurrent requests, they started failing.

Instead of fixing the problem, we can just scale the amount of servers (horizontal scalable yay), until the problem goes away.

Now you have tons of 8GB/4C machines sitting (almost) idle, and immense amount of money wasted.

AlchemistCamp · on Dec 22, 2019

> Instead of fixing the problem, we can just scale the amount of servers (horizontal scalable yay), until the problem goes away.

This works if the bad engineering causes a linear increase in scaling costs. If the bad engineering causes an N! increase or a 2^N increase in scaling requirements, then there's a good chance all the servers on the planet aren't enough.

I'd say engineering is no easier to fix by throwing money at it than UX is.

the-rc · on Dec 22, 2019

Until you eventually run out of connections on the DB... And all the workarounds that ensue.

Aeolun · on Dec 24, 2019

Don’t worry, we have a REST service for that too!

conradfr · on Dec 22, 2019

Let's say software is "data -> code -> UI".

The two most important things are data and UI, the one thing we developers spend all of our time thinking and talking about is code :)

mtbcoder · on Dec 22, 2019

There are plenty of successful applications with bad UIs. Also, business logic is encapsulated in code - if that's borked, no amount of UI will save it.

masklinn · on Dec 22, 2019

> There are plenty of successful applications with bad UIs.

Because data >>> UI >> code.

> Also, business logic is encapsulated in code - if that's borked, no amount of UI will save it.

The assumption here is that the bare minimums are fulfilled. If the software doesn't work at all you have none of the above.

CPLX · on Dec 22, 2019

> a business can succeed even with a bad engineering team

If you’ve spent any time in and around Etsy the reality of this is a little more complicated. Etsy has had some very impressive people on its technical team throughout and an unusually large number of really talented developers. It’s also suffered from some serious disorganization in its technical path and complicated leadership situations.

It’s the mix of both that produced Etsy. I think with a bunch of “bad” developers they would have been dead long ago. It’s been more like a (often) good engineering team making (often) questionable decisions.

SQueeeeeL · on Dec 22, 2019

I think a fairer complaint would be "bad engineering leadership", which is basically "bad leadership"...

Someone who cares deeply about the company will understand their core product is flawed and work to find someone to make it better. Not ignore devs

davewritescode · on Dec 22, 2019

100% this. I try to remind younger engineers that struggle trying to make perfect code that there’s a ton of companies who wrote great code on the way to bankruptcy and even more companies running unmitigated PHP based disasters that are very very successful

_the_inflator · on Dec 22, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

I liked your comment.

I would say "bad engineering". Look at all the banking stuff (disclosure: finance/banking lead) and you will immediately recognize this as a reality.

Why? Because there are no KPIs that relate to speed, sustainability, maintenance. There is a distinction between input, output and outcome.

Also a lot of architecture is done using power point. Drawing a line in PP is easy, connecting the services in reality is often times anything else than trivial.

GordonS · on Dec 22, 2019

> Also a lot of architecture is done using power point. Drawing a line in PP is easy, connecting the services in reality is often times anything else than trivial

From a lot of the "startup stories" I read on HN, I rather gather that often there simply isn't an architectural vision, PowerPoint or otherwise.

TheRealDunkirk · on Dec 22, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

Many, many years ago, I was asked to take on a new role as the system administrator for a group of advanced-degreed engineers, with their Unix workstations and FEA/CFD software. I setup centralized servers to host their software, a user directory, shared home directories, backup systems, and a Citrix server for Office. It was a tremendous success, and everyone loved the new setup. Upgrades to software became a breeze, people could access their files from anyone else's machines, and most got rid of their secondary PC for just doing email.

The drafting department, with all of their CAD machines, was a different story. Software and data directories were cross-mounted around all of the machines. It was such a mess that, when we had a power outage, it would take them THREE DAYS to get everything going again. This happened a few times a year.

I moved on to another new role to setup a Sun E10K, 3 cabinets of EMC disk, a 384-tape backup machine, and all the stuff that went with this. I was trying to explain the difference in these setups to my (very senior) IT manager, to get a point across about what I was doing. (I might have been arguing about setting up a third, "backup area network," in addition to the second "administrative network," but memory fades.) I got done, and she stared at me like I had a hunchback, and said (in reference to the drafting department), "Yeah, but it WORKS, right?"

You're absolutely right. Senior execs look at the fact that business is improving, and think to themselves, hey, we must be doing something right, and move on to other things. However, it is a source of continual disappointment that there is SO MUCH lost opportunity in the decisions that get made about how IT should be done. And, what's more, those bad decisions are being made by people who, by their role and their pay, should really be expected to understand the difference in my 2 examples, and what the extra overhead is costing.

I mean, if you had two physical production lines in plant, which were in series with each other, and one of them couldn't do anything for 3 freaking days after a power outage, while one carried on as if nothing had happened, plant managers would immediately fire a line manager who told him, "Yeah, but it WORKS, right?"

I think these situations endure because the emperor has no clothes. The people who are calling the shots can't understand the technology that MAKES the difference, and don't want their ignorance exposed.

0x445442 · on Dec 22, 2019

Great insights. I'd also add; the notion of UI/UX and good business covering up engineering is highly dependent on the business space. The further down the call stack you go the less you're going to get away with bad engineering.

If the product is Digital Ocean or DynamoDB or embedded software installed in Ford F-250s UI/UX and slick business aren't going to mask bad engineering.

rhlsthrm · on Dec 22, 2019

Depends. I've heard that Heroku's backend container networking stack was a total mess (not sure if it still is). However they seem to have done well despite that because they had good docs, good marketing, a good UX in the form of DevEx.

pixelmonkey · on Dec 22, 2019

Yes. I think this Etsy story showcases an interesting split in engineering approach, between being "right" vs being "effective". Perhaps a variant on "worse is better".

Etsy was originally built by two NYU students while they were still in college. It was their first production software project. It grew super fast in a period where the whole industry was still figuring out how to scale web applications; perhaps only Amazon & eBay had actually figured it out by then.

Then VCs got ahold of Etsy and tried to put out the fires by replacing the "kids" who built the thing. This was around 2006, which predates even Twitter's fail whale and Series A funding to round.

The founding team wasn't a bad engineering team. It was a young founding team who built a thing that grew like crazy, and which, had it not been built that way, would have meant no engineer could have ever come along to preside over a rewrite. But because it was crashing and dollars were at stake, and because there was some drama that prevented the founding team from having the time to build a proper scale-up engineering organization, a bunch of people came in and immediately reached for a rewrite.

They had a working thing, on a not-that-uncommon stack, but the new engineers insisted on a rewrite. Why?

Because new engineers always insist on a rewrite. Because, as my friend James Powell once put it, "Good code is code I wrote; bad code is code you wrote."

Python and Twisted. I am a big fan of Python and watched the evolution of Twisted -- the rise in popularity, and then the decline. I can imagine how they came across a group of engineers who were absolutely convinced Twisted was the silver bullet to solve their engineering challenges, at that specific moment in time -- when Twisted was at a peak of popularity.

Just like Django/Rails was the silver bullet in 2011. (Or Scala?) Just like Angular was in 2015 (or Go?). Just like React is now (Or k8s?).

Of course, I am not using the "silver bullet" term accidentally. Tech trends change, but "there is no single development, in either technology or management technique, which by itself promises even one order of magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity." (Brooks, 1976)

New tools are just an excuse for a rewrite, and rewrites are the ultimate shiny object to a new engineering team, brought in to "fix" an existing and working system.

As an aside: the founding software team at Etsy didn't get nearly enough credit for building a thing that worked and that grew from nothing. If they had focused on the "right" architecture, none of us would have ever have heard of Etsy, and we certainly wouldn't be debating the rewrite. That founding team didn't get enough credit, neither in equity nor in history. But that's another story.

excerionsforte · on Dec 22, 2019

It is hard as a new employee to not say "stop everything and let's rewrite this" when something is broken. New engineers are not only the ones to say it either as the craze and popularity of what's hot captivates longer tenure employees. If something vibes with the problem you are solving then it is only natural to want to use it. Proving the usefulness and viability is the part no one really wants to do as it is a long slog that is against the need of getting a promotion. Why not just use in a new app to prove it? and that will be the basis for let's use this everywhere (aka generalize it!)

It is very nice to try new things though because without that we can't move forward, but how much forward is enough seems to be the troubling part as investing time learning the "new unproven thing" every year or so is taxing.

> As an aside: the founding software team at Etsy didn't get nearly enough credit for building a thing that worked and that grew from nothing. If they had focused on the "right" architecture, none of us would have ever have heard of Etsy, and we certainly wouldn't be debating the rewrite. That founding team didn't get enough credit, neither in equity nor in history. But that's another story.

Very unfortunate reality here.

pixelmonkey · on Dec 22, 2019

Indeed. And, to be clear, I don't think rewrites are always wrong. I just think they are very, very dangerous, and need to be treated as such. I wrote about a successful rewrite I presided over in this essay, "Shipping the Second System". But anyone who was on my team when we did that rewrite will tell you: we were far from optimistic about it. We only did it after exhausting ALL other options, and it was a risk-managed project all the way through.

https://amontalenti.com/2019/04/03/shipping-the-second-syste...

excerionsforte · on Dec 22, 2019

Interesting read.

> “the second system is the most dangerous one a man ever designs”

Never thought about it that way. Feature creep and aversions to what you thought went wrong from the last time get tested here.

> So, all in all, we did several “upgrades to stable” that were actually bleeding edge upgrades in disguise.

I can relate. Being burned by certain upgrades can push one to veer on the side of caution even if the previous version had bugs. Software is never finished as they say.

lazyant · on Dec 22, 2019

I'm having a Baader-Meinhof effect; I recently read about this "Second system syndrome", I think in "The Unicorn Project".

lowercased · on Dec 22, 2019

> Because new engineers always insist on a rewrite. Because, as my friend James Powell once put it, "Good code is code I wrote; bad code is code you wrote.

Good code is code that is testable and tested, bad code is code that is code that is not tested (and may not even be testable).

As someone who's written too much non-tested (and sometimes non-testable) code, I'm pushing myself to test, and to hold that line, on projects both large and small.

Came in to a project last year that had been around 4 years. It was not only untested, but the original code in place turned out to be somewhat untestable altogether. 16 months later, we're still finding "rock bottom" moments of "holy shit this is garbage". By "garbage" I mean finding out that data that people that was being recorded was being discarded or overwritten, for months. Financial data being simply wrong, and 'correct' data not being recoverable.

First reaction looking at it was "this needs to be rebuilt". No no no, we were told - it's already working it just needs a few more features. To the extent that things were "working", they were working on accident. There were no docs or tests to demonstrate that things were working as expected.

The last year has been spent keeping it 'going' while trying to unravel and repair years of bad data is continually uncovered every few weeks. "We don't have time to rewrite!". The fact is, probably 60% of the code that was in use before has been rewritten, but it's taken over a year. It would have been faster to do it from scratch, and worry about importing/transforming data as we verified it after the fact.

So... good code is testable and tested. Absent these qualities, if I'm the 'responsible party', I will advocate strongly for a rewrite. If that 'rewrite' is 'in place', people will be told up front everything will take 2-3x longer than they expect, because we're now doing discovery, planning, testing and whatever 'new feature dev' you were asking for.

Part of the problem with this particular system was that it was started as a rewrite, but for the wrong reasons, or... executed wrong. "We can't understand the old code", so they tacked on a second framework in place and started blending the code - but still documented nothing, nor wrote any tests, or had a repeatable build process. Nothing. So instead of one undocumented system, they had 2, but one was in their head, so it was 'OK'. Based on what we inherited, one could only assume they really didn't have any understanding of the domain problems they were tackling, which just compounded every bad decision.

pixelmonkey · on Dec 22, 2019

Well said!

lowercased · on Dec 22, 2019

thank you :)

jyounker · on Dec 24, 2019

> All I'm saying is you can build a pretty successful business even with a bad engineering team.

A good business model covers a multitude of technical sins.

mlthoughts2018 · on Dec 22, 2019

Twisted is a widely used and well-understood framework that continues to work well and to be a major consideration-worthy option even in the days of modern Python 3 asyncio.

I can’t say whether the Twisted consultant’s advice was good or bad re: business needs, but your comment seems very wrong regardless. A solution built on Twisted would be nearly the opposite of something no one understands, and there would be big communities to go to for help, and you wouldn’t even have too much vendor lock-in since many frameworks that are competitors to Twisted could be used with minimal code changes.

StefanKarpinski · on Dec 22, 2019

You’re missing the point. At a Python shop, using Twisted would have been a fine choice. Adding a completely unnecessary Python layer at a PHP shop was not helpful. Yes, there exist people in the world who know Twisted, but they do not tend to work at a place where everything is written in PHP. Adding a centralized choke point and extra network hops to reduce latency was a bad technical strategy, and adding a layer in a language that the existing engineering team didn’t know was also a bad idea.

neya · on Dec 22, 2019

On an unrelated note, I hate how this has become the trend for long-form content- What should've been a normal blog post is now being chunked into smaller distracting paragraphs with each containing its own paths to deviate from the original topic (comments, retweets, etc.)

I wonder what's the motivation though - is it the likes? The informal setting? I really miss those days when we just had content sitting with default font styles inside plain 'ol HTML tables.

sirn · on Dec 22, 2019

Though not universally the case, I like to quote @foone's thread on why he publish on Twitter and not a blog post:

https://twitter.com/foone/status/1066547670477488128

(Threader: https://threader.app/thread/1066547670477488128)

gist · on Dec 22, 2019

I buy into that explanation. In short it implies a few things:

1) It's not about you the reader it's about me

2) I don't intend to have so many tweets but after writing 1 or 2 I am inspired in a way that keeps me going and I end up at 20.

3) I have not figured out a way to channel my creative energy if I am faced with something that feels like work rather than a reaction or to fill an immediate creative inspiration.

DandyDev · on Dec 22, 2019

Genuine tip then: write your story on Twitter, then copy paste the compiled story from Threader into a proper blog post and replace your tweets with a link to the blog post. Problem solved!

jeremyjh · on Dec 22, 2019

People expect more from blog posts. Its ok for a Twitter thread to be rambling, to backtrack and clarify the premise halfway through, but this is not accepted in a blog post, which needs to be structured more like a traditional essay and it needs to be edited. Someone without the attention-span or patience to write a blog post in the first place is certainly not going to go back and revise/edit a perfectly good tweet-storm just to make someone on HN happy.

__Joker · on Dec 22, 2019

Echo that. I somehow feels this reflects the op's post.

We went from blogs to twitter because you know for whatever reasons. Now we are tweeting out blogs as series of threads and using something like ThreaderApp[https://threader.app/] to recreate the blog.

dannyw · on Dec 22, 2019

It takes no effort to write out a story. It takes effort to make a Medium blog, and worse still, you have to get an audience for it.

johnpowell · on Dec 22, 2019

I made it about three tweets deep and just hit the back button hoping someone would cover the salient points in the comments here.

z3t4 · on Dec 22, 2019

It's the path of least resistance. I used to write one blog post per day, but when I switched to a static site generator I only write one blog post per month. The difference was that before I went to my web site, then wrote the content in a plain textarea. Or sometimes I wrote it in a text editor and copy/pasted it into the text area. Now I have 1) open the CMS 2) Create a new page 3) Fill in some meta-data like title, date and description 4) Write the blog post. Before all blog post was on one long page. Now it's one blog per page. So before I was OK with writing a very short message.

jays · on Dec 22, 2019

For many, it helps build followers and get lots of likes. I’m not sure that was the author’s intention.

Gasparila · on Dec 22, 2019

The author has a long-form blog going into much of a similar topic: http://boringtechnology.club/. Very nice read

pm90 · on Dec 22, 2019

I like it. Every tweet is constrained. Little opportunity to insert stupid memes everywhere. The conversational style is engaging. Asking questions is easy: post a tweet.

cellularmitosis · on Dec 22, 2019

I really don't understand all of the format hate here. Am I the only one with a working scrollwheel? This just seems like such a non-issue.

fooblat · on Dec 22, 2019

I really like to minimize distractions while reading (I love FF reader view). Reading a "long form" post on twitter is like trying to listen to someone tell you a story in a room full of people shouting at each other.

iudqnolq · on Dec 23, 2019

Also it seems as though Twitter never checks their non-logged-in mobile UI. More than 50% of the time I get a generic error message. Sometimes when I scroll down I'm randomly jumped to the top of the page. Comments load non-deterministically.

dazc · on Dec 22, 2019

'I wonder what's the motivation though - is it the likes?'

It does look like some sort of attempt to garner likes rather than an innovative means of publishing content?

Then again, that's twitter.

sethammons · on Dec 22, 2019

I can sympathize. We have a Twisted database abstraction layer that was built in house. You call http endpoints that perform database operations. It has routing logic so applications only need to call someEndpoint.json and not worry if we've sharded the backend recently, scaled read replicas, migrated a write master, etc. It has caching, query piggy-backing, connection pooling, and other bells and whistles.

While we have been able to scale it out for volume of requests, it did not scale well for multiple teams and multiple services. It became a single point of failure whereby the access patterns of one team to their own databases could affect other teams abilities to access their own databases.

We've since moved to a new model whereby team's own their datastores and access methods, speeding up development and reducing negative impact teams can have on each other's data layer. Legacy access continues to be migrated off endpoint by endpoint, database by database. I look forward to the day of never having to look at Twisted code ever again. The framework is aptly named.

If you are looking for a similar solution for MySQL, I currently recommend ProxySQL. It allows for a topography whereby teams can control their own proxy layer and still have most of the benefits outlined above.

excerionsforte · on Dec 22, 2019

Ha, Spouter a middle layer for talking to DB instead of just talking to DB (proxy to do connection pooling, etc could be between). This kind of project came up at my previous job and while I didn't know about Spouter at the time, I knew I didn't like the idea one bit.

Managing a DB + middle layer for a small and an already stressed DB team? Can't imagine that going well. The problem was that they spent too much time doing reviews for simple SQL patterns for CRUD + operational issues like debugging database performance issues or watching devs perform migrations. Of course they have other things to do in their job.

My opinion was the root of the problem is database choice and practices. If all I want to do is simple CRUD, then give me good scaled Redis cluster (AOF enabled), dynamo or something that constrains the query model. DB team can worry about about the cluster management leaving me to worry about structure. I could consult DB team if I needed opinions for multiple cases. Give me a way to watch db performance as well, so DB team does not need to watch over a migration at 12am or some off period.

Sometimes I may need higher performance or more dynamic queries, so just creating a table or elastic search with only indexable values to get the ids works. Use those IDs to fetch from original store.

AmericanChopper · on Dec 22, 2019

> My opinion was the root of the problem is database choice and practices. If all I want to do is simple CRUD, then give me good scaled Redis cluster...

Throughout my entire career I have come across very few simple crud applications that will actually work properly with denormalized data structures. It’s not even about scaling users, it’s about scaling your schema. The very first instance you encounter where a document has a nested array will start to cause problems, and the only way to solve it is to push more complexity and state management onto the client. Which is bad interface design, and quickly gets out of control.

If you really do have a simple crud app, and don’t want to put too much effort into running your RDBMS, just use MySQL imo.

excerionsforte · on Dec 22, 2019

I was working at a company of like 400 employees using MSSQL, so simple CRUD is not limited to simple crud app. Simple CRUD operations are insert, select, update, delete. Given that we auto generated the code and sql (stored procedures) and pasted it into the codebase made me think, why bother with this if I could develop new features without SQL at all.

Why should my queries be a stored prod or prepared stmt if all I need is get, insert, update, delete an object. Yes, the auto generated code wrote to the database as an object as in all fields included in the query. Why? There's no transactions, foreign keys, etc. since some tables where stored in different databases. Using a relational db as object store included people inefficiently using it as an object store.

In my own projects, if I need the speed, columns are indexed while unindexed data is binary. I can explode out (unserialize) the binary data into cache.

Point of my post was centralizing database access tends to be a bottleneck for team efficiency. If not the devs then the db owners. Augment my abilities to execute don't try to completely change it where it brings a whole set of unwanted problems that were not there before.

AmericanChopper · on Dec 22, 2019

I’m not saying you made an inefficient choice. If you only need to interact with objects that have no relations or nested lists, then denormalized data is likely not going to be a problem. But this is a remarkably niche use case. In practice, nearly all cases I’ve seen where people have come to this conclusion, it’s because they didn’t properly analyse their schema to begin with.

That’s not to say all use cases for such technology are niche. A lot of CMS applications can fit into that paradigm very well for instance.

I’d also say that storing binary data in an RDBMS is a seperate anti-pattern all together.

excerionsforte · on Dec 22, 2019

Every column you write to a db is binary at some point. When you choose a type, all you asked the DB to do is interpret the stored binary data in that way. By choosing binary type to store data, you've declared to the db not to interpret the data. You don't hit issues with charset collation/encoding, database interpretation (lack of 128bit ints), and etc. Think about it, what is an integer? It is 4 bytes of data.

Anti pattern depends on use case. Yes, I absolutely do not agree with your generalization. Storing binary data in an RDBMS is not an anti-pattern. Binary data can be of any size. The bigger the binary data the less rows you should expect to store. At some point (maybe the binary data is images/watermarks, etc), you have to choose a replicated file system to use that as the part of the datastore operations.

Furthermore, I've only worked on high qps applications, so maybe I'm a bit biased on how to use the database efficiently. :)

Aeolun · on Dec 22, 2019

Congratulations, you’ve offloaded to your app what your database is designed to deal with.

Storing all data as binary is an anti-pattern too, regardless of your qps.

I wouldn’t store anything but metadata in the database, the blob can be somewhere like S3.

excerionsforte · on Dec 22, 2019

It depends on use case. If you cannot take that then agree to disagree and discussion is no longer warranted. I've been on both sides and take my experiences with me.

You can assert anti pattern, but knowing how to structure your tables matter. SQL has BLOB, BINARY, VARBINARY choose the proper type depending on trade off. Models that include blobs can be structured in DB to avoid IO issues (indexed data includes id). Go to S3 with the id. How does what you are saying differ? My first post literally says this.

Not only can be structured to avoid IO issues, but are protected via a cache where the unindexed data is exploded.

Of course with high QPS I want to offload CPU cycles away from the DB. Scaling the app is easier than a DB.

Scratching my head here. Are you arguing for just use only SQL? Why would I do that.

orf · on Dec 22, 2019

I’m sorry, but no matter what you think storing all unindexed columns as binary is very, very weird. You can write all the documentation you want as to why this is superior to what literally everyone else is doing and has been doing since the dawn of time, but it won’t stop people joining your project thinking that you are mad.

This kind of thing is the stuff of horror stories. Now I have no doubt that you’ve convinced yourself that this is a great approach, but it’s not, and it will be replaced as soon as you leave (assuming you’re not a one man band).

evanelias · on Dec 22, 2019

> I’m sorry, but no matter what you think storing all unindexed columns as binary is very, very weird

You may be surprised to learn that several very large and well-known social networks use this technique -- serializing unindexed columns into a blob (typically compressed at some level) -- for their core product tables. It's not really that "weird", if you consider that literally tens of thousands of engineers work at companies doing this.

Conceptually it's exactly equivalent the same technique as putting all unindexed columns in a single JSON or HSTORE column. Newer companies use those; older companies tend to have something more homegrown, but ultimately equivalent, typically wrapped by a data access service which handles the deserialization automatically.

This technique is especially advantageous if you have a huge number of flexible fields per row, and many of those fields are often missing / default / zero value, and the storage scheme permits just omitting such fields/values entirely. Multiplied by trillions of rows, that results in serious cost savings.

excerionsforte · on Dec 22, 2019

It's a very boring technique really in hindsight. I guess one is really at the top 1% when you work at a high scale company and you see techniques that question the assumptions one holds. Databases like MySQL, as you pointed out, are embracing this technique, but make the inner data indexable.

Also https://www.skeema.io/ looks like a good product that I'll have to checkout. Looks like a better product so far compared to solutions like flyway/liquibase. Full featured suite for DB migration from dev -> prod is exactly what I've been raving about. Like it is "boring" tech as in no one really wants to touch it, but it is the easiest to screw up and products like this really take it to the next level.

The responses I've seen in this post appear to be from people who've never used dynamic fields in a db and advise against it by saying it is an anti pattern. If it is an anti-pattern, bring on all the anti patterns as I'd like to not wake up at night or be pinged for slowness.

evanelias · on Dec 22, 2019

Yeah, it's always interesting seeing the contrast between textbook computer science approaches vs practical real-world solutions. For some reason, people get especially hung up about academic CS concepts in the database world in particular... I've found things are never that clean in the real world at scale.

Thank you for the kind words re: Skeema :)

Aeolun · on Dec 22, 2019

Just because a very large and well known company does something, does not mean it’s a good idea.

If you put something in a JSON column you give your DB some expectation about the data.

If you store your data in a binary column you are just doing what the database already does in it’s backend with it’s normal columns, and any optimizations it might make are lost.

It might result in cost savings, but in my experience cost savings at the expense of transparency/usability are always a bad idea.

evanelias · on Dec 22, 2019

Sure, at small scale, it may be a bad idea. At no point did I say it was universally a good idea, just that it was not in fact "very very weird" as the GP claimed.

At large scale (especially social network scale that I'm describing and personally spent most of the past decade working in), the cost/benefit calculus is quite different. I'm talking 8-figure cost savings here. And there's no negative impact to "transparency/usability", since application queries are all going through a data access layer that understands the serialization scheme, and rando staff members are not allowed to arbitrarily make manual queries against users' data anyway.

As for optimizations: a custom serialization scheme can easily be more optimized than how the database stores JSON/BSON/hstore/etc. A custom scheme can be designed with full knowledge of the application's access patterns, conceptual schema, and field cardinality. Whereas whatever the db natively gives you is inherently designed to be generic, and therefore not necessarily optimized for your particular use-case.

Aeolun · on Dec 23, 2019

I don’t think the argument that it doesn’t hurt transparency holds any water, since you have the exact same limitations on your development environment (e.g. no clue what the database actually stores)

If we’re talking about the scale you’re speaking at, your data storage costs might go down, but at the same time you’re now doing a deserialization action for every row, you have no options of retrieving only half the values, so you’re always going to do that on all fields now. The costs for your data access layer rise correspondingly (but maybe that’s a different team, so it doesn’t matter).

If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).

Whenever something sounds extremely counterintuitive, people show up defending it because it’s what they have been doing for years and it makes a sort of twisted sense.

I do appreciate all your messages on this topic though, they have been much more informative than the GP’s original comment, and I have no doubt there’s actually people doing it now.

If I ever run into a situation where storing all my fields as one big blob makes most sense, I’ll revise my opinion.

evanelias · on Dec 23, 2019

Thank you for being open-minded, I appreciate it. There's definitely truth to what you're saying -- at a high level these decisions do have major complexity trade-offs, and that inherently harms transparency.

That said, in large companies and large complex systems, it's not unusual for any given team to have no clue about how various far-flung parts of the stack work. Dev environments tend to look quite a bit different there too; for example there's often no notion of a singular local dev database that you can freely query anyway.

In terms of transparency and debugging, this usually means some combination of in-house tooling, plus using a REPL / code to examine persisted values instead of using a db client / SQL. Also automated test suites. Lots of tests -- better to programmatically confirm your code is correct, rather than manual querying. Many SREs view having to SSH to debug a prod server being an anti-pattern, and you could apply the same notion to having to manually query a database.

Regarding deserialization overhead, sure there's some, but deserialization is heavily optimized and ultimately becomes pretty cheap. Consider that more traffic is from apps than web these days, and app traffic = APIs = deserialization for every request itself. And lots more deserialization for each internal service call. And significantly more deserialization for every cache call, since caches (redis, memcached, etc) don't speak in columns / SQL anyway. Adding deserialization to database calls is honestly a drop in the bucket :)

excerionsforte · on Dec 23, 2019

I want to point out here that the comments here in general were not conducive to giving out information. Your comment (shown below) was dismissive so there was no way I was going to elaborate. I recommend that if you are skeptical about something that you ask questions about it and we can discuss it more not simply dismiss it as "anti pattern". I would never talk/comment the way if I wanted to learn from others.

Point of my original comment got lost because of this "binary data in rdbms is anti pattern" detour. It was the least important thing that several people wanted to zero in and pile on.

> "Congratulations, you’ve offloaded to your app what your database is designed to deal with.

Storing all data as binary is an anti-pattern too, regardless of your qps.

I wouldn’t store anything but metadata in the database, the blob can be somewhere like S3."

> If you find you want to filter on one of those fields now, do you run a reindexing and update operation on a few trillion? rows (guess that might happen regardless though).

And also my comment above answered this filtering question from the get-go which underscores this distraction detour: "Sometimes I may need higher performance or more dynamic queries, so just creating a table or elastic search with only indexable values to get the ids works. Use those IDs to fetch from original store."

Let's be honest questions like "how to structure your tables", "how to make your datastore operations more efficient" (non exhaustive list, seems I need to point that out) veers on consulting, which is typically paid for so of course comments on the topic will have less information than you would like. Skeptics, who make blanket assertions, will usually be disregarded. Some may be willing to divulge and others may not be.

Aeolun · on Dec 24, 2019

I think we’re on the same wavelength about this. We would have both dismissed each others’ comments and moved on our way.

I feel the need to thank the people that even so make an effort to elaborate their point.

Don’t get me wrong, I’m still convinced it’s an anti-pattern, it was just very politely explained to me by someone so I feel obliged to respond with the same.

Like I said in the other post, I’ll let you know if I ever reach a point where this is the only reasonable solution and I can’t think of something else.

excerionsforte · on Dec 24, 2019

Learning != convince. May be some type of pessimism gained through time with people trying to convince you among others what is right. You missed the key words and point of what I said as did the others... the big picture. Goodbye.

AmericanChopper · on Dec 22, 2019

> You may be surprised to learn that several very large and well-known social networks use this technique -- serializing unindexed columns into a blob

Not in an RDBMS though. The issues with blob storage I’ve mentioned in this thread relate specifically to RDBMS, and aren’t relevant to technology like Hadoop.

evanelias · on Dec 22, 2019

I'm talking about in a RDBMS, MySQL specifically. I am describing the core sharded relational product tables of well-known social networks.

I've personally worked on the database tiers of multiple social networks and am stating this concretely via many years of first-hand experience. At no point am I talking about Hadoop in any way, not sure why you've assumed that.

AmericanChopper · on Dec 23, 2019

> At no point am I talking about Hadoop in any way, not sure why you've assumed that.

Because every large social network uses Hadoop to solve the type of problems you were describing.

> I am describing the core sharded relational product tables of well-known social networks.

Well depending on how you set your clusters up, you’re probably not actually using an RDBMS for blob storage. If you have a multi master set up with any form of weak consistency guarantees, then you’ve introduced a way to violate key and check constraints, and no longer have a system that complies with Codd’s 12 rules for defining an RDBMS. If you’re writing to a single node, and offloading blob I/O to a seperate node, then you’re essentially just using a MySQL server as an object store external to your RDBMS.

evanelias · on Dec 23, 2019

> Because every large social network uses Hadoop to solve the type of problems you were describing.

Hadoop is not used for serving real-time requests (user-facing page loads) in any social network that I am aware of.

You may be misunderstanding my comments. I'm describing use of the blob column type to store serialized data for unindexed fields. For example, metadata describing a particular user or piece of content. Let's say there are 100 possible fields/properties/columns for each user, but the data is sparse, meaning many of these fields are null/zero/default for each row. Rather than having 100 "real" columns mostly storing NULLs, social networks tend to serialize all these together (omitting the ones with empty values) into a single blob column, ideally using a strategy that maps the field names to a more compact numeric format, and storing this in a binary format (conceptually similar to BSON).

This has nothing to do with "blob storage" meaning images/audio/video.

This has nothing to do with Hadoop. Social networks / UGC sites don't use Hadoop for "blob storage" either, btw.

This has nothing to do with multi-master setups, which are generally avoided by social networks due to risks of split brain and other inconsistency problems completely tangential to this topic.

This has nothing to do with foreign key constraints or check constraints, which are implemented entirely at the application level in a sharded social network, again for reasons completely tangential to this topic.

My original comment in this subthread was in response to a claim that using blobs for unindexed fields is "very very weird". I feel that claim is incorrect because it is a common approach, used e.g. by Facebook, Pinterest, and Tumblr (among others). I think you may have misread or misunderstood this subthread entirely.

AmericanChopper · on Dec 23, 2019

> You may be misunderstanding my comments. I'm describing use of the blob column type to store serialized data for unindexed fields.

I understand perfectly. This is a very inefficient use of RDBMS. It violates every single normal form, and will lead to all of the performance bottlenecks I mentioned.

> This has nothing to do with foreign key constraints or check constraints, which are implemented entirely at the application level

As I stated before. This is not an RDBMS, it’s simply using RDBMS components as a storage service. An RDBMS must enforce key and check constraints, if this is enforced entirely at a higher level then you’ve violated the nonsubversion rule, and you’re dealing with a system that is not an RDBMS.

An RDBMS can only be called so if it conforms to Edgar Codd’s relational model, that is the exclusive definition of an RDBMS. Simply managing relational data on some level does not make an RDBMS, excel does that and it’s certainly not an RDBMS. So does MongoDB, and it’s not an RDBMS either.

evanelias · on Dec 23, 2019

> An RDBMS must enforce key and check constraints, if this is enforced entirely at a higher level then you’ve violated the nonsubversion rule, and you’re dealing with a system that is not an RDBMS.

There are well over a hundred companies using a sharded database layout. This typically entails not using the RDBMS's built-in support for foreign key constraints, since they can't cross shard boundaries.

In your mind does this mean a sharded MySQL or Postgres environment is no longer a RDBMS? That's an extreme view if so, and IMO not a terribly useful or relevant distinction in the real world.

> will lead to all of the performance bottlenecks I mentioned.

In the situation I've described (large number of sparse fields), serializing non-empty non-indexed fields into a single blob results in smaller row sizes than making all the fields into real columns. And generally speaking, smaller row sizes tends to improve performance.

Look, I've literally spent a significant chunk of my life working on some of the largest databases that exist. I post here using my real name and have my information in my profile. If you want to have pseudonymous pedantic semantic arguments about why you claim the largest database fleets are no longer "relational", I suppose that's your prerogative, but kindly refrain from trying to lecture me about vague unsubstantiated "performance bottlenecks" in widely-used real-world approaches.

AmericanChopper · on Dec 23, 2019

But you’re not talking about Postgres or MySQL. You’re talking about a larger bespoke system where one of the components is MySQL. The system you’re describing has:

* An object store

* A bespoke mechanism for defining schemas

* A bespoke mechanism for querying data

* A bespoke mechanism for mutating data

* A bespoke mechanism for enforcing relationships

* A bespoke mechanism for enforcing check constraints

The performance bottlenecks I described are not vague at all. They are well known constraints of RDBMS. To come into a discussions about such constraints and say “well if you only use the RDBMS for object storage, and then implement your own bespoke systems to replace all other RDBMS functionality with something completely different, then you don’t need to worry about that” is completely asinine, and to claim that such a system is still an RDBMS is just factually wrong.

You might be an excellent distributed system engineer, but it really seems like you don’t actually know what an RDBMS is.

evanelias · on Dec 23, 2019

In light of your earlier blatant misstatements about Hadoop, I don't think I'm the one with a knowledge problem here.

The original post here is entitled "Scaling Etsy". The ultimate solution that Etsy used, to solve the problem described in those tweets, was to move to sharded MySQL. With no database-side foreign key constraints. This architecture has served them well for over a decade now.

By your rigid definitions, Etsy isn't using an RDBMS, which would seemingly make all your RDBMS-related comments off-topic.

excerionsforte · on Dec 23, 2019

Proof of how fast FB, Twitter, Tumblr and all sorts of companies is already out there. Some people have a negative reaction (dissent and disbelief) to FB using PHP and that they should have rewrote it in the different language, which is more than absurd. "PHP is an anti pattern" or whatever.

I have seen zero examples so far of the fears and dissent raised by several people. Threats of horror stories, bottlenecks or "just because you can do it doesn't mean you should" are all kinds of dismissive, anti-social and social anti-patterns unconducive to learning.

I think we can leave it at that. No point in this endless replyfest. Some people will stay with their beliefs and never be open to what is done in practice and works in hopes to just being "right".

AmericanChopper · on Dec 23, 2019

I replied directly to you sighting the specific reasons this is non-performant in an RDBMS, and you simply ignored it.

There is also nothing to disbelieve about the system described in the parent comment. It a perfectly sound design pattern. It’s just not an RDBMS. It is a system that uses an RDBMS as an object store, and then uses external systems for enforcing ref and check constraints, schema definition, querying, mutation and consistency. If you’re just using collection of RDBMS to store objects, and performing all other operations in external systems, then RDBMS design constraints are mostly irrelevant.

barrkel · on Dec 25, 2019

It makes sense when you have enough scale that DDLing all the sharded databases for every release from every team is too much overhead.

(Storing unindexed columns (as JSON rather than binary though) is where I'm headed at my job, due to dynamic schema, the cost of coordinating DDL, and the excruciating cost of joins.)

excerionsforte · on Dec 22, 2019

I saw what you posted before: "You are mad". Before you press the send button you should make sure to follow (https://news.ycombinator.com/newsguidelines.html) or your comment can be submitted for moderation. Leave the emotions out if you want to discuss, thanks. I can personally attack you too, but not interested and not worth the time.

As I commented below. Unsubstantiated generalizations are what I distrust. From your comment here, you say "horror stories" where is the substantiation? Notice how my comment said specifically" If I need the speed", "database interpretation" of values. The use of caching for exploded unindexed values. Not sure what it is exactly you don't like other than binary data (512-64kb) in database that is protected by cache. There is literally no change over existing patterns of how to store data in a db other than telling the db not to interpret the values. Are you smearing the idea of lookaside caching in front of a DB (SQL none the less)? Plenty of companies use it.

I prefer to use SQL data types unless I need the speed. What's so unreasonable? I never said that I use it everywhere like you purport it.

Aeolun · on Dec 22, 2019

> knowing how to structure your tables matters

I think we can agree on that. What I don’t think we agree on is that storing all non-indexed columns as binary is a good idea.

There are a lot of non-obvious issues with storing blobs in your relational database.

I’m sure you can mitigate them, but ultimately I think it’s better to just not give yourself that headache.

excerionsforte · on Dec 22, 2019

Unsubstantiated generalizations are the sort I disagree with strongly. Why? because it never looks at the use case. Dismissing all ideas just on the basis of one thing is a documented Anti-Pattern.

Let's do google search 'storing blobs' to find what do people have to say with their experiences.

First link: https://dba.stackexchange.com/questions/2445/should-binary-f...

>Most RDBMS have optimizations for storing BLOBs (eg SQL Server filestream) anyway

Second to last link: https://www.quora.com/Is-it-a-bad-design-to-store-images-as-...

> Bill Karwin is correct that you can't make unequivocal statement that storing images in a database is always bad. I like a lot of his points. Most people who are against images in the DB usually are promoting the only thing they know how to do.

and I'm not talking about images. Object data that I would serialize is can very between 512b to 64kb (Always interesting to see the scale of memory nowadays, kb is so small).

AmericanChopper · on Dec 22, 2019

Those answers are both dubious and very outdated. At the time they were written, the most mature and widely available solution for blob storage was the filesystem, which has its own set of shortcomings for that use case. This is no longer true, and that alone frankly makes those answer obsolete.

Then you have the fact that those answers are ignoring a lot of facts. The answer you referenced from Bill Karwin talks about mutating blobs. This isn’t a reasonable design pattern at all, you create a new blob and update the reference. The downsides to storing blobs in an rdbms are so numerous that you’d really have to have a very strong justification for doing so, and I’m really struggling to think of any that are actually technical.

burpsnard · on Dec 22, 2019

One i saw, was a banking backend; stored blobs retrieved by accountnumber from the mainframe were passed to mid-tier servers that mounted the blobs as r/w db instances.

Sort of like piles of sqlite db's on S3

Aeolun · on Dec 22, 2019

I mean, you do you, but you probably see the pattern in the responses to your comment.

It is theoretically possible that there’s a good use case for storing all fields in a blob. It’s just that none of the ones you mentioned are ones I agree with.

The only thing that makes sense to store in a blob is something that is otherwise incomprehensible (e.g. a jpeg file, doesn’t fit in any other column type), and even that is a bad idea.

excerionsforte · on Dec 22, 2019

Dissenting/disbelief responses do not appear to be from people who've worked with RDBMS at the high scale. I'm content with being able to keep the system simple, sleep at night, not getting pings for database and feeling good about the choice. Big enough reward for me and any others who have dealt with this.

AmericanChopper · on Dec 22, 2019

> you’ve offloaded to your app what your database is designed to deal with.

A lot of NoSQL design patterns lead you into this trap tbh. If you’re trying to make NoSQL work with relational data, you end up in a lot of situations where you have to choose between complex, over-engineered interfaces, or pushing more work into the client. For example, requiring the client to manage intermediate states because you can’t make efficient use of transactions.

AmericanChopper · on Dec 22, 2019

Storing blobs in a DB is nearly always a bad idea. The only reason I can ever think to do it would be if you had some insane business constraints which for some reason made it the least bad option. Even though most RDBMS support the blob type, they are absolutely not optimised for handling it. Blob access is always going to be slower through a DB, blob I/O will take resources away from the rest of your app, due to concurrency control your DB is always going to be a bottleneck and is always going to be difficult to horizontally scale, it will also slow down your backups, and running backups eats into your DB performance again, so you want to make them as efficient as possible. Due to the sophistication of object storage solutions these days (which infinitely scale horizontally), you don’t even need to deal with the pain of the filesystem to avoid using blobs.

beachy · on Dec 22, 2019

All good points, but there are other reasons to store images in the DB.

One is making it much easier to copy graphs of data, e.g copying an account with a bunch of attached users, widgets and their images.

When everything's in the DB this is just "insert into ... Select from". When images are held in e.g s3 this gets an order of magnitude harder.

maxpert · on Dec 22, 2019

This reminds me of similar debate I had with one of my friends falling for NodeJS FOMO. The actual codebase was RoR and he wanted to go for Node for no apparent reason. I was able to talk him out of it but what I understood from discussion was peer pressure of “hey all of this could have been async”. While I admit blocking threads might not be permanent solution, they can still take you pretty far. Not to mention the folks wanted a MERN stack and I believe leaving Postgres for Mongo without a damn good reason is just crazy idea.

demosito666 · on Dec 22, 2019

Isn't SQL -> mongo a meme now? I mean, are people still seriously considering this outside of one or two very specific use cases? At my workplace the only outcome of this would be an awkward silence and weird looks.

aniforprez · on Dec 22, 2019

Yeah having worked on a mongo stack there are N problems caused due to the unstructured nature of mongo and there's so many problems with existing ORM solutions chiefly the fact that they're trying to create structure where it doesn't exist. This was a decision taken aeons ago and there's not much we can do to change it now but hope the guy who made that decision is happy cause I sure am not

celim307 · on Dec 22, 2019

Only thing I’ve found mongo good for is the warm data path where you need slightly more permanence than a straight pub sub (like getting the last 2 minutes of events upon connection) but you really don’t care about throwing the data away after that

mcfunley · on Dec 22, 2019

OP here—if it’s not obvious from the tweets the timeframe of this story is 2007 through 2008.

vegetablepotpie · on Dec 22, 2019

>I managed to not get fired because through this whole thing I was talking shit about it.

I know it’s unprofessional to comment on other developers work negatively, but I do see it often in the work place and it does serve to distance an individual from the group that is failing.

It’s sad but true, and this is an example, employers say they want professionalism but they incentivize against it.

pm90 · on Dec 22, 2019

There’s 0 downsides to criticizing everyone else’s projects and systems so a lot of people do it. This is kinda exacerbated by the fact that we engineers have a reputation for being incorrect about the stability of systems we build ( there are good reasons for this, and I personally don’t think it’s possible to avoid that impression even when it’s not intended).

I wonder if this team would have done better with tools that offer better observability (metrics, logs, tracing etc.). An example of a rewrite that went well https://avc.com/2019/12/grinding/

PragmaticPulp · on Dec 22, 2019

It’s amazing how relatable these anecdotes are. Change a few key words here and there, and this tweetstorm could describe most engineering leadership failures I’ve seen.

Nearly every tweet describes scenarios that can only happen when engineering management is M.I.A. or too inexperienced to recognize when something is a bad idea.

- Attempting to solve problems with a rewrite in a different programming language. This can be the correct long term decision in specific scenarios, but it’s rarely the correct answer for bailing out a sinking ship. You need to focus on fixing the current codebase as a separate initiative rather than going all-in on a rewrite to fix all of your problems. Rewrites take far too long to be band-aid solutions.

- Rewriting open-source software before you can use it. If Django doesn’t fit your startup’s needs, the solution is never to rewrite Django. The solution is to use something else that does fit your needs. Your startup’s web services needs are almost never unique enough to merit rewriting popular open-source frameworks as step 1. Pick a different tool, hire different people if necessary, and get back to focusing on the core business problem. Don’t let the team turn into open-source contributors padding their resume while collecting paychecks from a sinking startup. Save the open-source work for later when you have the extra money to do it right without being a distraction.

- Hiring consultants to confirm your decisions. Consultants can be valuable for adding outside perspective and experience, but the team must be prepared to cede some control to the consultant. If you get to the point where you’re hiring a “Twisted consultant” instead of a web store scaling consultant, you’re just further entrenching the team’s decisions.

- “Nobody was in charge”. Common among startups who pick a “technical cofounder” based on their technical skills, rather than their engineering leadership skills. When you assemble a team of highly motivated, very smart engineers, it’s tempting to assume they can self manage. In my experience, the opposite is true. The more motivated and driven the engineers, the more you need explicit leadership to keep them moving in the same direction. Otherwise you get the next point:

- Multiple competing initiatives to solve the problem in different ways. Letting engineers compete against each other can feel like a good idea to inexperienced managers because it gets people fired up and working long hours to prove their code is the best. That energy quickly turns into a liability as engineers go full political, hoarding information to help their solution succeed while casually sabotaging the competing solutions. In a startup, you need everyone moving in the same direction. It’s okay to have disagreement, but only if everyone can still commit to moving in the one chosen direction. If some people can’t commit, they need to be removed from the company.

- The “drop-in replacement” team. This is just a variation of having engineers compete against each other. Doesn’t work.

- Allowing anyone to discuss “business logic” as if it’s somehow different work. This leads to engineers architecting over-generalized “frameworks” for other people to build upon instead of directly solving the company’s problems. At a startup, never let people discuss “business logic” as something that someone else deals with. Everyone is working on the business logic, period.

I have to admit that when I was younger and less experienced, I plowed right into many of these same mistakes. These days, I’d do everything in my power to shut down the internal competition and aimless wandering of engineering teams.

Ironically, these situations tend to benefit from strategically shrinking headcount. It’s not a fun topic, but it’s crucial for regaining alignment. The key is to remove the dissenters, the saboteurs, the politicians, and the architects creating solutions in a vacuum. You need to keep a cohesive core team that can move fast, commit to one direction even when they disagree, and not let their egos get in the way of doing the right thing.

The real challenge is that those employees tend to fly under the radar. The people quietly doing the important work and shipping things that Just Work can be overshadowed by bombastic, highly opinionated “rockstar” engineers. Founders need to be willing to let those rockstars go when they no longer benefit the company, no matter how good their coding skills might be in isolation. A coordinated team of mediocre but diligent engineers will run circles around a chaotic team of rockstars competing against each other.

pm90 · on Dec 22, 2019

> The real challenge is that those employees tend to fly under the radar. The people quietly doing the important work and shipping things that Just Work can be overshadowed by bombastic, highly opinionated “rockstar” engineers. Founders need to be willing to let those rockstars go when they no longer benefit the company, no matter how good their coding skills might be in isolation. A coordinated team of mediocre but diligent engineers will run circles around a chaotic team of rockstars competing against each other.

So much this. The bombastic Rockstars not only create unnecessary ego driven political fights that are distractions to a business delivering value, but on the occasions they do deliver, it’s often not what was promised. But their social skills (or privilege) allow them to cruise through failures.

Whereas a team of devs that work well together that are hungry to learn, willing to have an open minded conversation about systems, that deliver consistently: they are what really keep the business from falling apart.

mcfunley · on Dec 22, 2019

It’s amazing how relatable these notes are

(I made the tweets)

mistrial9 · on Dec 22, 2019

your excellent summaries here remind me of another era -- substitute the word GUI for "business logic" .. same aphorism ! I recall distinctly meeting with science or business teams while working on GUI desktop app implementation, and inevitably someone would say "just get the model/datastore/file format right and the GUI just sits on top of that" .. not true in practice !! too much effort, too many lost efficiencies, no such thing as fully general. If you want a non-trivial GUI in the end (a few dozen windows with layouts, more than that dialog boxes with control contents and preferences, and big YES for print) .. then (most) everyone has to at least admit there are design choice influences, or actually be touched by the GUI needs -- not intuitive, not perfect, but we always shipped !

coryodaniel · on Dec 22, 2019

> an infamous incident where one of the investors had to drive to Secaucus to physically remove the other engineering founder from the cage.

I really want hear this story.

celim307 · on Dec 22, 2019

I’d love to hear more stories like this

paxys · on Dec 22, 2019

Does anyone have a version that isn't 37 Tweets?

conickal · on Dec 22, 2019

https://twtext.com/article/1194713712516423681

merlincorey · on Dec 22, 2019

The worst part is someone unrolled it on Thread Reader App, and now:

> Sorry this content is not available on Thread Reader

> It has been removed by the author.

redwood · on Dec 22, 2019

Thanks honestly the way we use twitter has dumbed down our culture so much it's infuriating.

booi · on Dec 22, 2019

I gave up.

allard · on Dec 22, 2019

"Interesting" "feature" of their platform that I see monthly, and more recently given the season. I get a message like "where's my table?"

But I'm not a seller. My username is simple and a common first name. I've approached them a few times, and there's no interest in fixing it. Yesterday, I found where I could open/sever the connection between those messages and my email. It's been going on for years.

Many confused users, although I have little idea how long they stay that way.