
How Discord Resizes 150M Images Every Day with Go and C++ - b1naryth1ef
https://blog.discordapp.com/how-discord-resizes-150-million-images-every-day-with-go-and-c-c9e98731c65d
======
Xeoncross
Why don't more companies resize images client-side first using <canvas> and
then save the server some work by only asking it to verify the result by

\- resizing to the same size

\- removing metadata

This results in much faster transfer (10x less bandwidth used often for mobile
uploads) and reduces server load by "farming" out the work to the clients.

[https://developer.mozilla.org/en-
US/docs/Web/API/CanvasRende...](https://developer.mozilla.org/en-
US/docs/Web/API/CanvasRenderingContext2D/drawImage)

# Edit: On Keeping Full Resolution Images

Some people mention having original highest-resolution images are important. I
don't think that is true for most applications.

Most apps don't need hi-resolution history as much as current, live engagement
so older photos being smaller isn't a big deal. As technology moves on you
simply start allowing higher-res uploads. Youtube, facebook, and others have
done this fine as the older stuff is replaced with the new/current/now()
content.

In fact, even our highest resolution images are still low-quality for the
future. Pick a good max size for your site (4k?) and resize everything down to
that. In a year, bump it up to 6k, then 10k, etc...

Keeping costs low has it's benefits, especially for us startups. Now if you
have massive collateral, then knock yourself out.

~~~
AndrewStephens
A few reasons:

1) Although the site serves up images at 1024 pixels (or whatever) today, in
the future they may want larger images. When everyone is rocking 10K monitors
and 6K phone displays, those small images are going to look pretty bad.

2) The original image has some metadata that they want to keep (geolocation,
etc).

3) They think they can do a better and more consistent job resizing than the
various browsers, which is probably true.

~~~
malux85
agree on 3) most browsers just use linear interpolation when resizing images,
which makes sense from a performance point of view, but looks terrible. Better
to use a bi-linear or cubic resize, more computing up front, but better
images, this is probably the reason they do it

~~~
amelius
But soon you can do any type of resizing through WASM on the browser.

~~~
XCSme
You can already do it, just use a library or implement your own scaling
function and don't use the built-in image resize functions.

------
sgk284
There is already an (unofficial Google) image proxy written in Go that is
quite fast, does caching (local or backed by S3/GCS), and does other nice
things like smart cropping:
[https://github.com/willnorris/imageproxy](https://github.com/willnorris/imageproxy)

Seemed like a lot of unnecessary work for them to reimplement a service from
scratch without gaining any major perf benefits over their existing one and
without leaning on an existing well-known and well-built foundation.

~~~
brian-armstrong
Author of the blog post here - it looks like what you linked does its image
resizing in pure Go. In our testing we found these libraries are significantly
slower than the C++ resize libraries. I would guess we would need at least 10x
as many instances if we used that resizer, though probably a lot more

------
gourou
Link to the resulting open-source project:

[https://github.com/discordapp/lilliput](https://github.com/discordapp/lilliput)

------
caltrops
I’d be very worried about a security issue with the unsafe C++ code.

You really have to run this kind of complex parsing in a disposable
containerized environment to do it safely. Or do everything carefully and in a
memory safe language.

~~~
bri3d
I'm not sure why this is being downvoted - image processing is one of the most
dangerous parts of a common consumer-facing web software stack. By and large
this is because image container formats are poorly documented, overly broad,
and rely on a lot of tricky binary parsing that's easy to mess up in an unsafe
programming language. It's also one of the most obvious ingress points for
untrusted binary data uploaded by an end-user, which is always going to be
dangerous.

See the persistent, years-long trend where mobile devices and game consoles
get exploited via some combination of libtiff and libpng.

~~~
Impossible
The downvotes are also because it's a somewhat cliche comment on HN now.
Anytime anyone is doing any with C or C++ that is even indirectly web facing,
"this could be unsafe!!!" is an obligatory comment, even though all major tech
companies have core components written in C++, and there are big web apps that
have been running for years that are mostly written in C or C++. Security is
definitely a concern, but these kind of comments can derail interesting
discussion, in the same way complaining about font readability or template
choice in an otherwise interesting article can.

~~~
caltrops
This isn’t one of those. Handing large amounts of unvalidated user input to
these libraries is particularly dangerous.

~~~
searealist
Unvalidated user input? What are you talking about, this is about image
resizing. Your buzzwords make no sense.

~~~
LambdaComplex
Yes, and images are user input in this case

------
devwastaken
How is the security? Any sort of image processing is a potential exploitation
point. I see it says it uses the 'mature' libjpeg-turbo and libpng
libraries,along with giflib for .gifs, but even with full trust of those, the
C code, patches, and changes ontop could be more exploitation points. You can
look through Imagemagick alone to see all the fun things possible when
seemingly basic processing turns into exploits.
[https://www.cvedetails.com/vulnerability-
list/vendor_id-1749...](https://www.cvedetails.com/vulnerability-
list/vendor_id-1749/Imagemagick.html)

~~~
Buttons840
Wow really? Is there room for another image processing library? Is ImageMagic
poorly written or is image manipulation inherently risky?

~~~
bri3d
ImageMagick is notoriously questionable. It was originally written, I believe,
as a local command-line tool for users to work with their own images, so
security and untrusted input were not primary concerns.

Additionally, image manipulation is inherently challenging - not even due to
the actual manipulation of image pixel data, but due to the proliferation of
complex image container formats which require binary data manipulation and
byte copying in performance-critical code. This is a minefield for secure
programming practices because it puts at direct odds performance and sanity
checking, as well as encouraging pointer and memory arithmetic and unsafe
access.

------
linkmotif
> Today, Media Proxy operates with a median per-image resize of 25ms and a
> median total response latency of 85ms. It resizes more than 150 million
> images every day. Media Proxy runs on an autoscaled GCE group of
> n1-standard-16 host type, peaking at 12 instances on a typical day.

Awesome! <3

------
throwthisawayt
Did it seem to anyone else that sticking to Python would have been way easier?
It didn’t seem like any of the performance gains were through Golang.

~~~
smaili
I believe this little piece answers your question:

> We likely could have addressed this behavior in Image Proxy, but we had been
> experimenting with using more Go, and it seemed like a good place to try Go
> out.

At the heart of if, they were looking for opportunities to use more Go in
their stack and they deemed this situation as a fit.

~~~
fleitz
The age old solution in search of a problem.

~~~
Karrot_Kream
I think that's a bit reductionist, no? There are many reasons they may have
been searching for moving to Go. Off the top of my head I can think of:

1\. Static typing increasing confidence and velocity

2\. Better developer-facing tooling increasing velocity

3\. More employees knowledgeable about Go than Python

4\. More enthusiasm (and therefore faster velocity) around Go development.

The blog post was about the engineering challenges they faced and how they
solved them and I think it was a great write-up in that regard. The post
wasn't about why they switched this service from Python to Go.

~~~
fleitz
It might be, then again I see a lot of wheel reinvention in tech / NIH
syndrome.

I'm the kind of hacker who if a service runs out of memory every 2 hours,
writes a crontab to restart it every hour after X random minutes so they don't
all restart at the same time. It gets a lot of eye rolls from the other
engineers searching for perfection, but it tends to produce services quickly
that are highly reliable.

And look now the engineers who like chaos monkey don't even have to set that
up. It's built in.

It looks like most of the savings were in switching from pillow to opencv,
something that thumbor already does. [https://github.com/thumbor/opencv-
engine](https://github.com/thumbor/opencv-engine)

~~~
brightball
Part of it is just Discord’s operating scale. They are already leveraging
Elixir clustering to an extremely high rate of concurrency and when you start
thinking about problems from that standpoint Go becomes a much more natural
fit within the stack for low level micro services.

------
JepZ
Anybody knows how well libvips
[https://github.com/DAddYE/vips](https://github.com/DAddYE/vips) compares to
liliput performance wise?

~~~
b1naryth1ef
vips (Go binding) is included in the benchmarks mentioned in the post, but at
the time of running them (~10 months ago) vips pulled 51482954 ns/op on a
1024x1024 test image, where as pillow-simd managed 3324135.3035 ns/op.

~~~
CapacitorSet
For ease of reading, that's respectively 51 ms and 3 ms.

------
manigandham
Nice, but why? [https://cloudinary.com](https://cloudinary.com),
[https://www.imgix.com](https://www.imgix.com), or
[https://www.filestack.com](https://www.filestack.com) already exist and are
well worth it for 99% of apps. Even at scale, it really doesn't cost that much
to have someone else do it. You can use a thin proxy through your existing CDN
if you want to save on their bandwidth fees.

Also [http://thumbor.org](http://thumbor.org) and
[https://imageresizing.net](https://imageresizing.net) if you want a library
to host yourself which are already very fast and well tested. Put them in a
docker container on a kubernetes cluster and it's all done in an hour.

~~~
zitterbewegung
Maybe it’s because that they don’t want a dependency on a external service
that could go down ?

~~~
manigandham
It's images... seems like a very low risk situation, especially when they are
served from a CDN.

~~~
reificator
As a user, the order of importance for Discord services is:

* Voice

* Text

* Previews (images, gifs, and videos)

Previews going down would be a pretty big deal for my communities based on the
way we use the platform.

~~~
manigandham
It’s just resizing, you still have the source images and can use those.

------
ymse
This post reminded me of a very old article from Yahoo/Tumblr explaining how
they were (ab)using Ceph to generate thumbnails on the fly as pictures were
uploaded using the Ceph OSD plugin interface.

Unfortunately the post seems to have disappeared from the internet (it was
probably around 6 years ago), so here are some other teasers:

[https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-
ob...](https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-
object-storage-at)

[https://ceph.com/geen-categorie/dynamic-object-interfaces-
wi...](https://ceph.com/geen-categorie/dynamic-object-interfaces-with-lua/)

Disclaimer: not affiliated with Ceph apart from being a happy sysadmin.

~~~
noahdesu
Here is a link to a talk I gave last month describing how to use Lua to
generate thumbnails remotely in the Ceph/RADOS OSD servers.

Talk is from Lua workshop 2017. Relevant content begins at 15m40s.

[https://youtu.be/bGQc-PpJAyk?t=15m40s](https://youtu.be/bGQc-PpJAyk?t=15m40s)

------
kylehotchkiss
I wish Cloudfront supported resize parameters so we wouldn't have to keep
buildings these or paying a lot for Imgix.

~~~
fleitz
How much would you pay for an image resizing service? I'd been thinking for a
while of putting a fleet of autoscaled thumbor boxes behind cloudfront and
making a billing API for it.

~~~
kylehotchkiss
Imgix's $10 minimum is so much for a personal site with maybe 500 uniques a
month. If you're going for a service like that, think of people like me who
host on s3/cloudfront for $.20/month. But let people scale up to millions of
pageviews a month.

Don't need anything fancy. Just w=? h=? would be great, developers can handle
the DPI stuff with sourceset tags.

~~~
manigandham
Cloudinary is free.
[https://cloudinary.com/pricing](https://cloudinary.com/pricing)

------
Const-me
I wonder why people implement such things on CPU?

PCI express is ~100 gbit/sec, much faster than any network interface.
Internally, a GPU can resize these images by an order of magnitude faster than
that, see the fillrate columns in the GPU spec.

~~~
acdha
This isn't just resampling an image: decoding a variety of image (and even
video) formats, decompressing the selected frame, performing the actual
resize, and then compressing the result. If the resample doesn't save more
than the setup overhead, it'd be an immediate loss. Even if it does, there's
an engineering cost since you now need to make sure that all of your servers
have GPUs available, your chosen implementation code supports all of them with
acceptable quality and error handling, etc.

Since the GPU hardware has become commonplace, there's definitely a lot more
attention on using it in the server space and I think it'll become common in
the next few years but that has a migration cost for early adopters since
you're hitting less mature projects for critical functions. Internet-facing
image processing has a bunch of tedious but important work handling format
variations and errors (it'll be reported as a bug in your software if the
image opens in a browser and/or photoshop), making sure that you handle
gamma/colorspace consistently, etc.

If you're trying to get production-ready server out the door, it's really
tempting not to deal with any of that once you hit the point where it's fast
enough that engineering time costs more than the server savings.

~~~
Const-me
> This isn't just resampling an image

GPUs can do that, too: [http://fastcompression.com/products/jpeg/cuda-
jpeg.htm](http://fastcompression.com/products/jpeg/cuda-jpeg.htm)

> you now need to make sure that all of your servers have GPUs available

OP is running on google’s cloud: “n1-standard-16 host type, peaking at 12
instances on a typical day.” That instance costs $0.76/hour. Adding NVIDIA
Tesla K80 is $0.7 extra.

> it's really tempting not to deal with any of that

Yeah, that’s understandable. But the original article dealt with a lot of
strange technologies to get the performance they want. And ended up doing much
slower, performance wise, than what’s possible with a GPU.

~~~
acdha
> > This isn't just resampling an image

> GPUs can do that, too: [http://fastcompression.com/products/jpeg/cuda-
> jpeg.htm](http://fastcompression.com/products/jpeg/cuda-jpeg.htm)

Agreed - but for how many different formats, and how well do those
implementations support all of the various format options for things like bit
depth or palettes, compression variants, etc.? That's not just things like
compliance testing – itself a big problem – but also handling all of the
slightly non-compliant data in the wild which users will inevitably expect to
work.

(I'm somewhat biased having spent time dealing with JPEG 2000 imagery where
various lapses on the standards side meant that it's still common to find
images which don't display correctly in one or more implementations but are
silently reported as correct in others)

Again, I'm not arguing that doing this on a GPU isn't a good idea — the
hardware has become common enough that it's reasonable to assume availability
for anyone who cares — but just that there's significant overhead cost for
anyone who needs to handle images from unconstrained sources. It'll happen but
this kind of thing always takes longer than it seems like it should.

~~~
Const-me
> significant overhead cost for anyone who needs to handle images from
> unconstrained sources

Flickr is doing just that, and they’ve been using GPUs for more than 2 years
already:

[http://code.flickr.net/2015/06/25/real-time-resizing-of-
flic...](http://code.flickr.net/2015/06/25/real-time-resizing-of-flickr-
images-using-gpus/)

> It'll happen but this kind of thing always takes longer than it seems like
> it should.

I think the main reason for that is lazy software developers reluctant to
learn new stuff.

------
tuananh
is there any open source project img proxy that can do this?

eg: instead of this

[http://localhost:8080/https://octodex.github.com/images/code...](http://localhost:8080/https://octodex.github.com/images/codercat.jpg)

we can create alias like octo and url will become this

[http://localhost:8080/octo/images/codercat.jpg](http://localhost:8080/octo/images/codercat.jpg)

------
0xbear
That’s 1700 images per second. Doable on one (beefy) box. 3 to account for the
diurnal cycle. Am I supposed to be impressed?

~~~
brian-armstrong
Can you link to which resize library you're using? We'd love to see a 90%
further reduction in instances

~~~
mbrumlow
Sorry to be confusing, I am not resizing images. Just working with data sets
as large as what I image 150M images would be. The software I am working on
takes point and time backups of computers and uploads them to "the cloud", I
mean servers in a data center. There they can be virtualized with a click of a
button in mass or one at a time, and near instantly.

This involves transfering, encrypting, compression and creating checksum of
terabytes of data a hour (per node). While not exactly resizing images, I
would image the computational power was on par with the service described. The
entire system has about 4 PB or 8 PB in it right now, as backups are pruned
(based on what people will pay for storage).

My software has a ton of space to grow and become better, but I think a better
story would have been how discord handles 150M images a hour. If anything
bandwidth acquiring the source image would be what I would consider the
largest problem, not the CPU time to resize. In fact as long as your resize
code slightly faster than the download then streaming it in and out would put
your bottleneck entirely on bandwidth.

I will also note I am not a fan of libraries :p but that is not what this is
about.

EDIT:

Also kudos to you, somebody criticized your post and you had the best response
one could have. Inquiring minds are awesome.

~~~
rockostrich
Assuming the average image size is 3 MB which seems conservative, especially
if they're handling GIFs as well, this is 450 TB per day. If you're handling
that much data on one beefy machine then kudos.

