
How Shopify reduced storefront response times with a rewrite - vaillancourtmax
https://engineering.shopify.com/blogs/engineering/how-shopify-reduced-storefront-response-times-rewrite
======
pqdbr
Some of the listed optimizations were:

> We carefully vet what we eager-load depending on the type of request and we
> optimize towards reducing instances of N+1 queries.

> Reducing Memory Allocations

> Implementing Efficient Caching Layers

All of those steps seem pretty standard ways of optimizing a Rails
application. I wished the article made it clearer why they decided to pursue
such a complex route (the whole custom Lua/nginx routing and two applications
instead of a monolith).

Shopify surely has tons of Rails experts and I assume they pondered a lot
before going for this unusual rewrite, so of course they have their reasons,
but I really didn't understand (from the article) what they accomplished here
that they couldn't have done in the Rails monolith.

You don't need to ditch Rails if you just don't want to use ActiveRecord.

~~~
pushrax
(contributor here)

The project does still use code from Rails. Some parts of ActiveSupport in
particular are really not worth rewriting, it works fine and has a lot of
investment already.

The MVC part of Rails is not used for this project, because the storefront of
Shopify works in a very different way than a CRUD app, and doesn’t benefit
nearly as much. Custom code is a lot smaller and easier to understand and
optimize. Outside of storefront, Shopify still benefits a lot from Rails MVC.

I’ll also add that storefront serves a majority of requests made to Shopify
but it’s a surprisingly tiny fraction of the actual code.

~~~
why_only_15
Out of curiosity, why continue to implement in ruby? If milliseconds are
important as you mention, interpreted languages will always be slower.

~~~
adventured
They have a very good thing going. Perhaps there is no great reason to bite
off so much at one time. They can take their time and do that later if it
makes enough sense. I would expect it would require a very substantial effort
to rebuild their platform in a different language.

If you're 75/100 of where you want to be on performance, it can be easy to
lose immense amounts of time chasing a 95/100 type ideal performance outcome
when you can maybe far more easily get to 90/100 by making eg straight-forward
caching improvements to what you already have and not have to rewrite all of
your code.

Good enough is almost always underrated in tech. People destroy opportunity,
time, money, and entire businesses chasing what supposedly lies beyond good
enough.

John Carmack has a good example of this in his Joe Rogan interview [1], in how
id Software burned six years on Rage, making incorrect (in hindsight) choices
that involved trying to do too much. He regrets his old standard line and
approach that it'll be done when it's done. He wishes they had made
compromises instead and shipped Rage several years earlier. That's a pretty
classic storyline in all of tech, taking on far too much when 85% good enough
would have worked just as well most likely.

[1] [https://youtu.be/udlMSe5-zP8?t=8630](https://youtu.be/udlMSe5-zP8?t=8630)

~~~
switch11
very good point

very good example

------
lazyant
I didn't care especially for the technical details, what I like about this
article is that the first thing they mention is the success criteria of the
project (hopefully it was done at the very beginning, before any
implementation). Then on top of that, they created an automated tool to verify
such criteria automatically and objectively.

This is a great approach and unfortunately I don't think many (most?) software
projects start out like that.

Not defining conditions of victory and scope creep are possibly the biggest
risks in software projects.

~~~
chiefalchemist
It's not only software.

1) What is the goal? What defines success?

2) What are the KPI's? How are we going to measure it?

These are baseline questions to any endeavor of substance. Yet, they are
rarely defined.

~~~
vlovich123
It’s also important to remember that not everything worth doing or every
“success” state you set can have KPIs defined (either actually impossible or
the science may not be there yet).

~~~
chiefalchemist
To clarify, I was using KPI in the abstract. That is, how do I/we define
success? What does it look like? How will we know if we are or we're not?

------
gravypod
Shopify has traditionally been an example people have pointed to for scaling a
monolith with a large growth factor in all areas: team size, features, user
base size, general "scale" of the company.

Does anyone on here, who has worked on this project or internally at Shopify,
feel that this project was successful? Do you think this is the first, of a
long and gradual process, where Shopify will rewrite itself into a
microservice architecture? It seems like the mentality behind this project
shares a lot of commonly claimed benefits of microservices.

> Over the years, we realized that the “storefront” part of Shopify is quite
> different from the other parts of the monolith

Different goals that need to be solved with different architectural
approaches.

> storefront requests progressively became slower to compute as we saw more
> storefront traffic on the platform. This performance decline led to a direct
> impact on our merchant storefronts’ performance, where time-to-first-byte
> metrics from Shopify servers slowly crept up as time went on

Noisy neighbors.

> We learned a lot during the process of rewriting this critical piece of
> software. The strong foundations of this new implementation make it possible
> to deploy it around the world, closer to buyers everywhere, to reduce
> network latency involved in cross-continental networking, and we continue to
> explore ways to make it even faster while providing the best developer
> experience possible to set us up for the future.

Smaller deployable units; you don't have to deploy all of shopify at edge, you
only need to deploy the component that benefits from running at edge.

~~~
MirrorNext
At Shopify, we build into the monolith unless there’s a strong reason to build
it as a new service.

It makes more sense for us to extract things than to make everything
microservice.

Storefront makes sense to be on its own service, so we are making it so.

------
ww520
The performance related bits:

\- Handcrafted SQL.

\- Reduce memory usage, e.g. use mutable map.

\- Aggressive caching with layers of caches, DB result cache, app level object
cache, and HTTP cache. Some DB queries are partitioned and each partitioned
result is cached in key-value store.

------
bdibs
I’m aware that Ruby/Rails isn’t that quick, but it seems mind boggling that an
800ms server response time is considered tolerated, and 200ms is satisfying.
I’ve never used Ruby in production so maybe my reference point is off and this
is more impressive than I’m giving it credit for.

~~~
Scarbutt
For page reloads, anything below 300ms is fine.

~~~
dx034
But you should also account for up to 100-200ms network latency (especially
with mobile networks) plus some rendering time. A 200ms server response time
can already lead to a perceived 500ms loading time.

------
tehlike
This is very interesting. N+1 and lazy loading have been a very common problem
that profilers can spot, but eager loading also has a cartesian product
problem where if you have an an entity with 6 sub item, and 100 of another
subitem, you'll end up getting 600 rows to construct a single object / view
model.

I have been recently playing with RavenDB (from my all time favorite engineer
turned CEO), it approaches most of these as an indexing problem in the
database, where the view models are calculated offline as part of indexing
pipeline. It approaches the problem from a very pragmatic angle. It's goal is
to be a database that is very application centric.

Still to be seen if we will end up adopting, but it'll be interesting to play
with.

Disclaimer: I am a former NHibernate contributor, and have been very intimate
with AR features and other pitfalls.

~~~
balfirevic
Didn't NHibernate have the cartesian product problem solved in a neat way by
having various fetch strategies?

You could specify to eagerly load some collections and have NHibernate issue
additional select statement to load the children, producing maximum of 2-3
queries (depending on the eager-loading depth) but avoiding both N+1 problem
and cartesian row explosion problem.

~~~
tehlike
yes, that's the common method, but you still end up issuing multiple network
calls. The problem wit issuing select statements to load the children is you
have to wait on the first query (root) to finish so you can issue others which
adds to the network latency (usually low, but it also depends). It's still not
as good as having materialized viewmodels on server where you can issue a
single query to get everything you need. The disadvantage is the storage cost,
though.

~~~
balfirevic
I went and looked at the docs to refresh my memory - there was also a subquery
fetch strategy where you didn't have to wait for the root entity to load, but
that comes at the expense of searching through data twice - which might or
might not be worth it, depending on how complicated the query is.

I do wish relational databases (PostgreSQL and SQL Server specifically, since
I work with those) had better support for automatically updated real-time
materialized views.

Anyway, thanks for working on NHibernate - I miss some of it's configurability
and advanced capabilities.

~~~
jacques_chester
> _I do wish relational databases (PostgreSQL and SQL Server specifically,
> since I work with those) had better support for automatically updated real-
> time materialized views._

I've been keeping an eye on these folks:
[https://materialize.io/](https://materialize.io/)

------
aloukissas
Naive question: the "storefront" piece seems like it's a static page. Why does
it need SSR? Even so, it could be SSR'ed to static _once_ (kind of how NextJS
does this from 9.3+), then have it served by CDN/edge. I'm probably missing
something here.

~~~
raihansaputra
Throwing opinions here, but after working a bit with Shopify themes, there
might be some reasons to stick with SSR rather than aggressive caching. First,
the storefront can be dynamic depending on visitor region/login/logout.
Second, Shopify have most of the logic on the backend, even having non-js html
nodes for ordering/add to cart. Third, I don't think the visit distribution of
the stores makes caching economically viable (the top 20% store probably _don
't_ account for +60% server load).

------
kn8
Is the new implementation still Rails?

~~~
bsaul
That’s also my question after reading this post. When trying to shave off
milliseconds by going for a full rewrite, moving away from ruby seems like an
obvious decision...at least intuitively..

~~~
sbarre
Obvious how?

Are you going to restructure literally thousands of employees and their teams,
staffed with Rubyists and organized around your current setup?

Will you re-hire and/or re-train everyone?

That doesn't seem so obvious... At the scale of a team like Shopify,
refactoring to a different language is probably a non-starter.

~~~
nicoburns
If you have thousands of rubyists then you surely have hundreds who also know
other languages? Seems to make sense to use a fast langauage for the small
performance sensitive part of your codebase.

~~~
jashmatthews
"Faster" languages often have big advantages in small benchmarks which get a
lot smaller or even reverse once you're looking at whole application
performance.

Mandelbrot (from CLBG) Ruby 246s NodeJS 8s Java 4s

Web (fortunes from TE benchmarks) Ruby + Roda + Sequel 51k rps NodeJS +
Express 46k rps Java + Dropwizard 62k rps

~~~
nicoburns
You're comparing Ruby to other options that are still slow:

Java (vertx-postgres) 347k rps, Go (fasthttp) 320k rps Rust (actix-postgres)
607k rps

~~~
jashmatthews
Right but I'm doing that because those are frameworks in other languages which
offer a comparable developer experience.

fasthttp isn't even a web framework. It's not surprising that using a raw HTTP
library is dramatically faster than using a full framework and ORM but it's
also not a sustainable way to build complex web applications with 1000+
developers.

~~~
nicoburns
You don't need to have 1000 developers working on the small performance
sensitive part of your application though. Split it out into its own
application, and then have a small dedicated team.

I can't speak to fasthttp as I haven't used Go much, but actix-web in Rust is
a full framework (not as full as something like Rails, but certainly more than
mature enough to be used for production projects).

~~~
jashmatthews
I built and maintained a critical production web app using Iron for 3 years.
Keeping anything like the performance advantage you see in simple benchmarks
in a real app is a big challenge.

~~~
nicoburns
Well sure, that's why it only makes sense unless you actually need the
performance. But if you _do_ need the performance then implementing it in a
language that is designed to enable those optimisations can make a lot more
sense than trying to hack around the runtime in a slower language.

------
hevelvarik
>An example of these foundations is the decision to design the new
implementation on top of an active-active replication setup. As a result, the
new implementation always reads from dedicated read replicas, improving
performance and reducing load on the primary writers.

Could someone please explain how the ‘as a result’ follows from the active-
active replication setup?

~~~
throwdbaaway
Based on the comment from pushrax, it looks like this is just circular async
replication between the old writer and the new writer. For some reason, the
old implementation had to send both read and write traffic to the old writer,
while the new implementation can do proper read-write split, by reading from
dedicated read replicas hanging off the new writer (again, via async
replication).

Due to power law, ecommerce generally benefits a lot from things like caching
and read-write split. Reading between the lines, it feels like shopify may not
yet have sufficient experience in dealing with async replication, and all the
potential issues caused by replication lag. Fun time ahead.

------
thejacenxpress
Unfortunately they are still highly dependent on other APIs.

When San Diego Comiccon went live on funko.com (shopify) the website was fine
but the checkout was bottlenecked by the API calls to shipping providers. Many
never were able to checkout and Funko had to issue an apology.

Unfortunate that no matter how great you can improve your own product you may
still be dependent upon others.

~~~
randomdude402
I'm interested to know more about this. I've used about five different
e-commerce solutions and they all make API calls to shipping providers. What
was different here?

~~~
thejacenxpress
The amount of traffic was too high. Unsure if they were being throttled or if
they use a task queue that went bad. Who knows.

[https://comicbook.com/irl/news/funko-pop-comic-
con-2020-excl...](https://comicbook.com/irl/news/funko-pop-comic-
con-2020-exclusives-fans-bitter/#2)

------
momonga
I wish the article detailed the performance issues with the old
implementation, and why those issues necessitated a rewrite (other than
"strong primitives" and "difficult to retrofit").

------
spondyl
I'd be interested to know if setting Service Level Objectives were considered
as an alternative to using Apdex? Given that it's nice to be able to then
calculate an error budget out of your SLO and use that to determine whether
changes were impacting to the customer experience or not. Well, so the theory
goes anyway. Actually doing it in practice is a whole different story ;)

------
switch11
can anyone add to that article data on

What users saw in terms of response time

and perceived response time

And what users are seeing after the improvements

 __*

We had evaluated spotify for one of our projects and aesthetically it is
really good. However, time wise their store takes forever to do stuff

This was a couple of years back, so hopefully things are much better now

 __ __Basically, the article covers how much better THE TEAM doing the coding
feels

What is the effect on the users using the stores?

~~~
bradfeehan
spotify?

------
gadders
The bit I found interesting in this is how they compare and verify that two
web pages rendered by different methods "match".

I wonder how you would do that? You can't hash the html. Do you take
screenshots and compare?

------
notsureaboutpg
Most commenters are focused on the optimizations made, but I actually think
the custom routing and verification mechanism is the interesting bit.

That kind of a tool could be handy in lots of scenarios (comparing the same
service written in two different languages or with different dependencies,
etc).

But how does their verifier mechanism deal with changes in the production
database between responses? If the response of the legacy service comes first
and the response of the new service comes after, in between both responses
(the request being the same) couldn't the data from the database change and
thus result in the responses not passing verification when they otherwise
should have? How do they manuever around that issue?

Great write-up by the way! I really liked it :)

~~~
pushrax
Differing inputs causing verification failures is indeed an issue. In addition
to data access races, replication latency also causes this. The legacy service
always reads from the primary MySQL instances per shard, but the new service
always reads from replicas for scalability and geo distribution.

One slightly helpful mitigation we have in place relies on a data versioning
system meant for cache invalidation. The version is incremented after data
changes (with debouncing). To reduce false negatives, we throw out
verification requests where the two systems saw different data versions. It's
far from perfect, but it's been effective enough.

------
polote
tldr: rewrote the backend focusing on speed

Which is good. At Reddit they would have tried to rewrite everything on
reasonML and then tried to prove at the end that it is now faster

