Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020 (codeascraft.com)
24 points by kiyanwang on March 1, 2021 | hide | past | favorite | 17 comments


I run a small 3D printed jewelry business (https://lulimjewelry.com) where I sell on my own website and also on Etsy.com. The Etsy team handled Christmas great this year (although it was my first Christmas in the platform so I don’t really have much to compare it to). Specifically they handled the communication around the mail delays very well, both with me and with my customers. It’s really interesting to see the tech decisions that had to be made to make those communications!


Does anyone know why the first deployment of the day needs to compile the whole codebase on all hosts? Stumbled over this part.


I think it's simply a badly worded way of saying they would need to push the build to 1000+ hosts after they compile it.


Let's hope they're not trying to log in with Google on their phones, because that hasn't worked for about a year now.


I ordered quite a few items from etsy over the holidays and all by logging in with Google on my android phone


Just tried it with my iPhone 12 Pro and worked fine. Logged in with Google on the first try.


The article was rather a disappointment compared to other articles Etsy has put up on their blog.

It was overall just lackluster and the TL;DR can be entirely summed up as, more than doubled traffic without issue (and no technical insights really).


I agree. I learnt nothing from it.


The need for code freezes around high traffic events is usually a sign of very poor engineering culture, where patching extreme tech debt is much higher priority than technical excellence or product improvements that help the customers.

A code semi-freeze for weeks is incredibly alarming. How much unremediated tech debt must be causing that?

That is a massive red flag that Etsy engineering culture must be something like the SRE tail wagging the product engineering dog. I would bet that you only get rewarded at Etsy for being a visible firefighter, stopping failures on nights or weekends, jumping in front of outage bullets, and that if you take the time to design systems robust enough and safe enough that they don’t need code freezes in the first place, you are never going to get recognized or rewarded.

It’s baffling that not only did they have to resort to a long code freeze - but they even approved a public blog post bragging about it.

Boy, that tells you a lot about the culture. Not going to be seeking jobs there any time soon.

To put in context, I work for an ecommerce company that routinely sees 10-15x traffic around Black Friday -> Cyber Monday, and around Christmas to New Year. We don’t do any code freezes for the load, almost all of which is handled through auto-scaling in our on-prem data centers.


As far as I'm aware this is a very common practice for orgs with high seasonality. Have you worked somewhere with high traffic and high seasonality that doesn't implement code freezes around known peaks?


Yes, I work in a high traffic (Alexa rank above 400) ecommerce site specializing in stock photography, music and digital assets. We see a 10-15x spike around Black Friday, around Christmas to New Years, both in direct consumer sales, inventory upload, new sign ups (customers and sellers) and in ad inventory sales.

We don’t do code freezes except in very special, isolated cases of known failure risk. When we do need an isolated code freeze for one system, that is a “all hands on deck, this had better get fixed” kind of moment. If the larger system of most of our core services needed an extended code freeze to be safe, that would result in probably (justifiably) firing senior engineering leaders.

If you have a CI/CD system and you can’t trust normal automated deploys at any time, that is a huge problem. If this happened one year and you needed the code freeze, so be it - that’s just responsible risk assessment. If you intentionally plan the system to work this way every year that is egregiously bad engineering leadership.


An opinion no doubt informed by extensive experience in high-volume ecommerce.


I run a machine learning team in a high traffic ecommerce company (specializing in stock photography and digital assets). We see an even bigger seasonal spike than Etsy.


With no code chill at all? Not even for artifacts which, presumably unlike what your team works on, lie in the critical path for revenue generation aka checkout?

If so, that's very impressive, but also very atypical. Source: I work in high-volume ecommerce, have friends and colleagues at other companies who do likewise, and no one I've talked to about it works anywhere that doesn't implement at least some controls on at least checkout-path deployments during holiday prep and through holiday proper. It isn't a tech debt thing, it's just good business sense: the time of year when people give you by far the most money is the time of year when you least want to risk causing problems for people who want to do so. The problem isn't untrustworthy systems or bad engineering culture; the problem is that anything short of perfection has a measurable impact on revenue, and humans achieve perfection with less than perfect reliability.

Even where I work, that revenue impact can easily amount to a major problem if we let a significant bug slip through. People buy our product for its own sake, and will do so from Amazon or Walmart or wherever if they can't get it direct - but our margin on sales through second-party vendors is nothing like as good as on direct sales, so we still take a hit. For Etsy, it's much worse; on sales through other storefronts, they get nothing. Too, a bad enough break during holiday will also risk some fraction of their subscriber revenue from sellers who, fed up at losing out on what would otherwise also be their highest revenue of the year, might go elsewhere.

It's easy to talk shit about tech debt and bad engineering culture, I get that. It's also poorly founded in fact, and evinces an apparently questionable grasp on some fundamentals of the business, besides. I don't know whether you have represented yourself well or poorly by creating this impression, but it is the impression you've created, and if that bothers you then I might suggest trying to do otherwise next time.


The machine learning team is responsible for all search and discovery features that users engage with when searching our inventory, some of the highest traffic, uptime demanding services I’ve seen anywhere in my 15 year career. We also run all of the image and language processing that occurs in real time in the image editor post upload, along with a HA queue system that aggregates requests to route them to shared GPUs on the backend. In our case, many machine learning systems are on the critical path for customers, and we have a significant platform team within machine learning specifically to ensure we can support this, meet uptime demands, design auto-scaling solutions in our on-prem data centers, execute failover and disaster recovery (and drills to simulate these) and more. And so does every other team.


Okay, that makes sense. Why do you assume, or at least give the very strong impression of assuming, that no one else does anything like this? Or even just that Etsy doesn't?


Because they just wrote a big article about how they created a culture, developed special vocab (“slush”), and intentionally plan on a multi-year basis all around a code freeze that is not just a day, not just a weekend, but lasts for weeks and affects all systems (which either means no systems are in a state of proving their resiliency, or leadership doesn’t trust engineers, or both), and they are proud of this.

That is a whole mess of giant neon red flags that something is wrong here regarding engineering for resilience and trusting your deployment system to catch errors.

Additionally, for a freeze lasting weeks, unfreezing and resuming merges from some backlog of frozen changes would likely introduce far greater risk to revenue or to customers than managing a trustworthy deployment workflow during high traffic events. I would even question the fundamental premise that this is safer or beneficial to customers at all. It seems much, much more likely to be a “cover your ass” method to prevent any possible blowback from outages due to unremediated tech debt that the weak leadership is not capable of protecting as a worthwhile priority for quarterly goals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: