
How We Are Improving Performance - snackai
https://blog.imgix.com/2017/03/17/note-on-performance.html
======
gingerlime
We were using imgix for a while and were generally happy, but things started
go downhill and some point, or so it felt anyway. Their support was always a
bit opaque. And the service itself didn't evolve (as far as rendering. E.g.
Composing more than one of the same filter wasn't really possible... for
example adding two watermarks). We've also had issues with CORS headers that
weren't resolved and our end users couldn't get images some times...

We switched to host our own thumbor (open source), and couldn't be happier. We
pay around a quarter or less than before (even with failover in place) as
well.

We really wanted to use a hosted service. We're not keen on hosting stuff out
of our core business. But in this case it just didn't work out.

EDIT: link to Thumbor
[https://github.com/thumbor/thumbor](https://github.com/thumbor/thumbor)

~~~
skuhn
Sorry to hear that your experience with imgix wasn't great. Performance,
quality and correctness are obviously super important and non-negotiable
aspects of our service. While we have always heavily focused on these areas,
we're going to double down for the short term.

We do have some interesting solutions in the lab for more elaborate
compositing amongst other things. Part of the challenge there is to stand up a
new API interface, since the URL API gets pretty cumbersome for coordinates
and tons of composites. We have some interesting ideas there, but it needs to
bake a little while longer.

It's also a challenge to get that new stuff out while focusing heavily on our
core areas, but we have to do both. You'll start to see more exciting new
things ship from us over the next few months.

If you have a solution that works for you, that's great and best of luck. If
you do ever feel the urge to check in on what imgix is up to, we would welcome
it. Feel free to reach out to me directly any time (e-mail in profile) and
I'll work with you to get what you need.

~~~
gingerlime
Thanks. Not sure why you're being downvoted. Recognising those shortcomings is
important, and I understand that you might not have all the answers to all
customers.

I think your support could have been more responsive though, and communication
improved. You guys could probably detect that we weren't happy and about to
leave. Reaching out at that point could have made a difference perhaps...

We did genuinely want to work with imgix, but those hurdles left us looking
elsewhere. These things happen all the time. Competition is a good thing
overall. I'd like to hope that our feedback helps in some way. Even if it's
hard to hear.

~~~
skuhn
Yes, your feedback is definitely appreciated, and I regret that we didn't act
fast enough to retain your business. I will use it as a teaching exercise with
the rest of the team for what we could do better on next time.

The unfortunate reality this time around is that we got overwhelmed and had to
focus on doing work that had the most impact to the largest number of
customers. That meant that more personalized approaches to customer retention
kind of slipped through the cracks. This can't happen again, and I'm adamant
that it won't.

I see image processing services as an important area for innovation and growth
on the web over the next several years. It doesn't mean that imgix has to get
the business of every site on the Internet, but fundamentally I believe that
anyone who cares about how their content is presented should use a service
like imgix. Doing image manipulations by hand, or serving non-optimal formats
or sizes to browsers is going to be viewed as serving your site without using
a CDN: it just isn't the state of the art anymore, and the benefits are worth
it.

I welcome competition from other companies who run competing services, or
build-it-yourself solutions like thumbor, because that means we really were on
to something: the market is real and these services provide real value to
customers. I want imgix to be at the front of the pack, and that means doing
the hard work every day to make sure we're there.

------
mabbo
There's two kinds of apologies, and they look very similar on the surface.
There's "Sorry for this problem, but here's why it's not my fault" and there's
"Sorry for this problem. Here's what happened". Am I taking the blame for the
failure of others, or deflecting the blame _onto_ others?

This feels a lot more like the latter, and it's wonderful to see. I'm sure
it's a lot harder to write, but it shows a heck of a lot more integrity.

~~~
hinkley
I generally have issues when an engineering manager blames bad luck. Luck
doesn't happen as often as people like to believe.

For instance, losing three backup generators in ten minutes (as happened to
someone else a while back) might be bad luck, but I'd want to see the
maintenance logs and talk to the vendor. Because what's much more likely is
that they hadn't been maintained properly, or there was an manufacturing
defect in that run and they all came from the same batch (see also, clustered
hard drive failures).

If their upstreams have had intermittent problems before, it was only a matter
of time until they both had problems at the same time. And capacity planning
is one of those strategic decisions that gets bland on the engineering team,
who never get what they need because they're a 'cost center', which is a very
different and highly damaging way of thinking about the cost of doing business
(intrinsic costs).

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

> If their upstreams have had intermittent problems before

Sorry, this might not have been clear. We had never had major issues with any
of our upstream providers before -- beyond scheduled maintenances there had
never been a service interruption over a number of years.

There's almost always the risk of coordinated failure, no matter how many
levels of redundancy you put into place. The cost gap between 99.99% and 100%
is humongous for this reason.

From these failures, we learned some valuable lessons. One of our transit
providers was acquired a year or so back, and it's apparent that things have
dramatically changed in how they operate their service and respond to issues.
We won't be continuing the customer relationship there for much longer unless
we see positive steps taken. Another transit provider has been much more
responsive and the conversations have been more productive -- we can work
through the issue we observed and get back to a positive situation there (I
think).

And then finally, we need more diversity. We turned up another circuit
yesterday, and I'm still working on adding another 1-2 providers in addition
to more peering or direct connect situations. These take a bit of time, but
it's work we need to do before we actually require it, because by then it's
too late. We're working as quickly as possible to get this part of our
infrastructure on rock solid footing.

> capacity planning is one of those strategic decisions that gets bland on the
> engineering team

Absolutely. We were doing capacity planning, but it wasn't anyone's idea of a
good time.

We are correcting this with the resources we have today, and we're going to do
a better job in the short term. In the long term, this is a non-negotiable
core aspect of someone's job -- either already on the team, or a new role that
we create. It can't be a secondary, rainy day kind of task you take on time
permitting and it also can't be a task that we're not well equipped to handle
on a regular basis.

------
ShirsenduK
They put out a blogpost while their support keeps stonewalling and denying any
such issues. Nor do they reply on twitter or even tweet the blogpost. This
seems too little too late.

Disclaimer: I am currently a customer who had financial loss because of them.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

I'm sorry that you had this experience with imgix. We have had challenges in
correctly addressing support issues during this period -- support has been
swamped. We've expanded that team, and we're continuing to do so.

You're right, we didn't tweet this blog post, but we did e-mail it to all
customers with activity since Jan 1 2017.

In this case, we didn't properly convey that the Shopify integration guide we
provided wasn't an official integration with Shopify. It was meant as a best
effort "here's how you can do it" sort of thing. I requested that we take it
down because it wasn't working due to a change on the Shopify side which
prevented us from purging images.

I'd love to talk with you directly to try to make this right for you. If you'd
like, my e-mail is in my profile.

~~~
ShirsenduK
I understand your support might be swamped but this is what your support wrote
to me;

"Support is on-call during regular business hours, Monday - Friday, and
premium customers can request SLAs for support."

This, while other users complained to me via twitter/email that they were also
not getting any valid responses. I would not send those words while my service
is not working as expected.

I'm in the process of moving out of Imgix esp. after your support said they
would be happy to cancel my account if thats what I wanted.

The pushback from your support made me setup a production level thumbor
instance which will go live next week.

Thanks for your reply but I wish you had replied to
[https://twitter.com/troysk704/status/841158793287278593](https://twitter.com/troysk704/status/841158793287278593)
sooner.

~~~
skuhn
OK, totally understand your concerns. I see a few tickets from you in our
ticket tracking system, and in the interest of improving next time, we'll use
this as a teaching example with our support team.

Best of luck with your new solution.

~~~
ShirsenduK
Glad to see that you have acknowledged the problem. I am sure you will fix
them :)

------
trevyn
I wish this post gave more concrete details, so we could learn from it.

 _A critical piece of our network infrastructure failed after 3 years of
correct operation in a way that proved difficult for our network engineering
team to troubleshoot._

Which piece of infrastructure? How did it fail? Why was it difficult to
troubleshoot? Can you do anything to prevent this type of issue in the future?

 _We observed new traffic patterns with significantly lower cache hit rates
than our historical median, and it took us some time to determine whether the
source was abusive in nature or a legitimate new customer use case._

What were the new traffic patterns, and why did this cause lower cache hit
rates? Why did it take longer than expected to determine the nature of the
traffic?

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

> Which piece of infrastructure?

In this case it was a top-of-rack switch. That rack had a few of our external
DNS resolvers on it. Normally this wouldn't be a problem, because we deploy
services across multiple racks to prevent this kind of failure mode.

In this case it was a bit of a problem, because this was the first rack
deployed at this site and it turned out to not conform to the standard
configuration across the rest of the racks (because it was what we
bootstrapped the site with).

So two things: it was tough to get into the device to troubleshoot it (we
wound up using the serial port infrastructure deployed for this purpose), and
it had a service impact even though it shouldn't have (we have since migrated
DNS resolvers out of this rack and have scheduled a future maintenance to get
this switch's configuration corrected and ugprade its OS).

> New traffic patterns

This is a bit tougher to dig into, but essentially it was a needle-in-a-
haystack kind of situation. Customers can use any imgix URL API parameter they
like -- some of these use cases get pretty complicated for the backend to
handle. Think of the watermark parameter stacked with another URL that is also
an imgix URL with another watermark parameter, five layers deep. These sorts
of operations take a lot more rendering resources than a simple ?h=600&w=600
operation.

In this case, we observed an influx of these sorts of more difficult
operations. We have various logs and metrics sprinkled throughout the system
-- we use Prometheus, Kafka, heka, BigQuery, Grafana and a few other systems
to collect and present the data we need to run the service. We also issue
unique id's per request to track their path through the system. What we don't
have -- and need to -- is one end-to-end view of a request's path through the
system and the system's capacity and performance at each point in our stack.

It turned out that for some amount of the increased rendering traffic we saw,
it wasn't that we suddenly got more requests. There were simply many new
permutations for each original object. That lowers the CDN cache hit rate for
a period of time.

The other thing that comes to mind (and I think I'm forgetting a third type
here), is that some of our request parameters require normalization so that
they can utilize the same cache object. Think of parameters like dpr, which is
a floating point number but realistically is only useful up to a few decimal
places. dpr=1.33 and dpr=1.3333333333 are actually the same image, but they
would have different cache keys and require two renders that are effectively
the same object.

We normalize the dpr parameter down to three significant digits. What we found
is that this sort of normalization was necessary for another parameter as
well.

~~~
billyhoffman
Why isn't this in your blog post? Or at least a separate "technical details"
blog post linked from the announcement?

When you have an incident, especially something that hurts your customers
beyond just a service outage, you come clean, fast, and as completely
transparent as possible.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

We posted the blog post yesterday (Friday) afternoon, after getting the copy
ready Thursday evening. Friday was spent acquiring the right customer e-mail
list (a little harder than it sounds and something to improve in the future)
and taking one last pass over the post (among other tasks we're all doing to
actually operate the service, communicate to individual customers, etc.).

It's super important to be open and honest and fast in your response, but it's
also important to communicate correctly. The post in its current form is meant
to be transparent and apologetic but also show that positive steps have been
taken and results are being seen now. If we missed the mark on transparency,
or honesty, or the hints of positivity, it wouldn't have been the right
message for where we truly are at today.

I'm completely open to writing a more in-depth technical post, but that would
have compromised our timing to get this post out on Friday. That's why we went
with the post in its current form -- anything else would have taken longer to
get to our customers.

It's something that I'll think about more this coming week and hopefully get
out soon.

------
rcchen
I wonder if their render service is still backed by racked Mac Pros
([http://photos.imgix.com/racking-mac-pros](http://photos.imgix.com/racking-
mac-pros)). If so, considering the lack of updates around that machine for the
last several years, I wonder if they are planning to remain with that
solution.

~~~
ryanSrich
Wow. I just read that.

> We operate our own hardware, run our own datacenters, and manage our own
> network infrastructure.

This seems insane to me. Although I don't work with image processing beyond
"saving for web" in Photoshop, so I could be wrong. Why would they not use AWS
or any number of other cloud providers where capacity planning is handled for
you?

~~~
Exuma
I would guess because of costs. At one point I was going to do a video
processing app and after doing some calculations the costs were insane, and
the savings with actual hardware were tremendous. I imagine image processing
is a lot less than video, but maybe the same deal. Then again... they did rack
mac pros so who the heck knows what they're thinking.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

The main reasons are twofold:

A) cost. I haven't done the math yet in 2017, but my recollection of the
difference circa 2015-2016 between GPU instances on AWS and our solution is a
COGS that's about 3x higher on AWS. GPU instance prices have come down a bit
since then, but they're still very expensive and somewhat supply constrained
as well.

B) Technical frameworks. We'll reach an inflection point this year or next
where we've added or changed so much of the rendering pipeline that we've
basically written it from scratch, but until then we're still reaping the
benefit of building on top of CoreImage. It really benefitted us early on in
getting a MVP faster, and it's continued to pay (slightly reduced) dividends
over the course of the company's life. At some point the dividends will stop,
and then we'll move on.

I understand the skepticism around Mac Pros. There are challenges there -- it
isn't my most favorite solution of all time. It is a practical solution for us
though. I have no room for computer religion or deciding to do things because
they're cool or shiny or new. Anyone that's ever spoken to me for a few
minutes in person about technology stuff can attest to that.

The Macs don't have IPMI, for example. That sucks but we do have power outlet
control and a network installer as a way to back into "out of band management"
for them. They're largely stateless, so it's a tolerable solution.

They do run a Unix-like OS (thank god), and they do represent a good price /
point ratio for image rendering hardware. We could do a little bit better with
Linux servers and GPUs -- but the upside is only around 10-20%, and there's
still engineering work to do there. It's on our roadmap to explore this more
fully and maybe start taking more concrete steps in that direction. For now,
OSX still gives us more than it takes in terms of cost.

------
Nabi
Thanks for explaining and openness. We were bit frustrated as just finished
migration to imgix and hit many issues with images delivery.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

Sorry that you experienced these issues. I appreciate your understanding.

Please feel free to reach out directly to me (e-mail in profile) if there's
anything you'd like to talk through.

I'm optimistic that you will have a better experience going forward, but I can
guarantee you that we are doing everything possible to provide the service to
you and we will do a better job of communicating in future.

------
mnutt
_Improved our GIF encoding pathway to increase throughput._

I'd be curious to hear exactly what they did around this. I recently worked on
some gif encoding, and was surprised that there's actually quite a tradeoff to
be made between performance and good color palette choices.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

I don't have the specific details handy, but I can confirm that it's a bit of
a tough one. It's actually much easier to do work in pretty much any other
format -- the very reason why people use GIFs (universal support) is also what
makes them tough to work with (the spec is old and crusty and from 1989).

I'd like to do more of these sorts of posts -- talk about the challenges with
GIF encoding and how imgix has solved them. Probably not this month or
anything, but I'll try to get it on the content roadmap for April or May.

~~~
mnutt
That'd be great. I've found
[https://www.lcdf.org/gifsicle/](https://www.lcdf.org/gifsicle/) to be a good
code resource for optimization, if a bit hard to follow.

------
microcolonel
The company that literally racks Mac Pro trashcans sideways, is telling us
that they have made technical decisions that allow them to offer a competitive
price for features.

~~~
rb2k_
Ha, I thought that was a joke.

Apparently not: [https://blog.imgix.com/2015/05/08/racking-mac-pros-
hardware-...](https://blog.imgix.com/2015/05/08/racking-mac-pros-hardware-
design-for-web-scale.html)

~~~
brianwawok
Can you really not run image rendering code on Linux servers? Vs you know,
spending thousands of dollars racking OSX servers? Because that would make
"increasing capacity" about 3 clicks in AWS....

Perhaps I am missing the competitive advantage of using OSX to resize images,
but it sure seems like the shortcomings are obvious :)

~~~
microcolonel
Well, basically they developed the thing on macbooks with Apple's image
manipulation libraries (which are basically just implementations of published,
standard algorithms). Then they didn't bother porting it to something they
could scale. Instead, they opted to commit to putting genuine Apple hardware
on racks, something that even Apple doesn't do.

It's an understandable direction if your business is at the "Well, I have a
few of these mac pros sitting around, I wonder if I could make a living at
this" stage.

~~~
brianwawok
Sure, great way to POC. Then don't you immediately get funding and go hire
some devs to code it up for real? Vs hiring server guys to rack 100s of Macs?
Seems like something your VC should suggest to you, not "heres a million
bucks, go create rack mount macs.."

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

I've touched on this elsewhere, but here's the crux of the problem:

Racking servers, even if they're cylindrical, is actually much easier than
building an entire real-time image rendering pipeline from scratch.

The chassis that we co-designed and had made works out to a few hundred
dollars per machine, since we aren't buying in giant quantity. When you talk
about servers that run 24/7 and cost between $5k-25k per box, that's not the
dominant expense by any measure. It's also been picked up by some other
companies to use, which is always nice to see -- we aren't keeping it as a
proprietary solution, anyone can buy it from our vendor.

Either way we decided to go, we would still be solving both the image
rendering and machine racking / operation problems simultaneously to different
degrees. As it sits now, we have about 4-5x the number of infrastructure and
imaging engineers as we have datacenter managers / technicians on staff. As it
should be, I think.

This design decision is actually part of our advantage, in that it allows us
to deliver more functionality to our customers at a lower price per unit on
the backend. There's always room for improvement, and we continue to iterate
on it -- some day the Macs will probably be gone from the datacenter, but only
when it's the right move to make.

~~~
brianwawok
> Racking servers, even if they're cylindrical, is actually much easier than
> building an entire real-time image rendering pipeline from scratch.

I think this is still a penny wise, pound foolish decision.

I am sure it is less work to figure out how to buy and rack some trashcans, vs
writing an entire image pipeline for 1.0. I get that.

I also don't know your full specs. There are tons of FOSS image libraries out
there. What % of your workflow do they cover? Are you sure you are estimating
taking something like Pillow and adding in missing pieces, vs writing an
entire image library from scratch? Things like basic filters and image resizes
are a solved linux problem for a long long time.

Hardware lock-in always has a price. People that build on AWS exclusive things
like lambda do the same thing, to a smaller scale. It seems like you are
paying this lock-in price, and will continue to pay it. Perhaps it is the
right decision, perhaps not. But I cannot image locking my entire company into
the whims of a consumer trendy product line, when there is no fundamental
need. I would look really hard at hiring some devs to build out a proper
rendering engine...

~~~
skuhn
Without getting too into the weeds -- we do use OSS and third-party libraries
where it makes sense. Our rendering pipeline incorporates things like mozjpeg,
Intel IPP, jxrlib and other technologies. We're doing the integration and
augmentation work to make this all play nicely together with the right quality
and performance.

However, there is no off-the-shelf solution that solves the same core problem
as the imgix pipeline. Resizing or format conversion, sure. That plus the
types of graphical operations that Photoshop supports? Less common. Now do all
of that in one pass in 30-50ms, streaming to/from the GPU, with the ability to
easily add new workflows and operations? It's our core business, and we own
that workflow completely inside of our stack.

We built on the foundation of CoreImage and that gave us a leg up -- we can
take most of what we've built, replace the foundation and keep going on
another platform (such as Linux). The main challenge is that there is no
equivalent technology to CoreImage on any other platform. PIL (or Pillow) are
adjacent technologies, but I can tell you from experience trying to operate a
PIL-backed image thumbnailing service at scale it is not a large scale
solution. Better than ImageMagick for sure, but it still has limitations. Even
something like RenderMan is not tuned for this kind of workflow -- it's
awesome at what it does, but latency and scale are not its central concerns.

There are risks to imgix's business, and this is one of them, but I'll be
pretty surprised if it sinks us. We have some pretty sharp cookies working on
our rendering platform, and while success on that front isn't guaranteed it's
entirely within the realm of possibility.

------
amelius
It says on their homepage that they use a 3rdparty CDN. I'm wondering how you
can improve upon that if you are just a customer from the CDN's point of view.

~~~
skuhn
[I work at imgix and have helped lead the team on the production issues we
face and gathering the details for this blog post]

imgix does utilize a partner to handle some of our image delivery components.
We're a customer of that CDN in that we pay them money for the services they
provide, but we're actually more of a partner in that we have done
considerable integration work and we work closely with them to build what we
need to deliver our product.

We talk with them frequently, and there are pieces that aren't just off-the-
shelf that we utilize to provide the integrated service we sell to our own
customers.

To be clear though, the issues we faced that are discussed in this blog post
are not related to the CDN or delivering already rendered images to end users.
It's around rendering new images, which doesn't involve the CDN.

------
mkup
Font on [https://www.imgix.com/pricing](https://www.imgix.com/pricing) page is
broken in Windows/Firefox (look how small "e" and "a" are displayed):
[https://image.ibb.co/ei7m8v/windows_font_problem.png](https://image.ibb.co/ei7m8v/windows_font_problem.png)

This is without zooming (100%), 120 DPI, Windows 7 x64.

Please use standard fonts and don't overengineer, font hacks like this will
never work as expected across all platforms.

~~~
skuhn
Thanks for pointing this out. The pricing page is on an older version of our
site design, and I believe this may already have been corrected in the design
proofs we've been working on.

I'll prioritize rolling out the fix for this particular issue next week.

------
Thaxll
Imgix is a good example of burden of legacy code / infrastructure that they
can't get away from.

