
Flickr – A Year Without a Byte - el_duderino
https://code.flickr.net/2017/01/05/a-year-without-a-byte/
======
ideonexus
One area that may be impacted by this strategy is bulk-downloading your
photos. My wife set her phone to automatically backup all her photos to
flickr. Thinking she could get them back at anytime, she started deleting them
off her phone. After three months, I discovered she was doing this and begged
her to also backup her files to an external hard drive.

Every year we sync all our family photos to have redundant backups. When she
went to get the three months of backups from flickr she got "download error"
after error. She sent me this link and hypothesized that the bulk-download
feature is no longer working because of the need to now first decompress the
files before transmitting them.

Luckily, she was able to get the 25 gigs of family photos down using a third-
party application, but it's another reminder to never wholly trust the
"cloud."

------
Nition
Really interesting that (if I'm reading the article right) you can take an
already-compresed JPEG, recompress it losslessly using another technique to
get _better_ compression than the original compressed JPEG, and then
decompress it to the original JPEG again.

The concept makes sense but I'd never thought of that before.

~~~
Niksko
No, I think you've misunderstood.

What they're saying is that you can take a JPEG compressed image, decompress
to raw pixels, and then recompress with JPEG more efficiently (if you're
careful, no specifics on how this is done), and you save space.

That's why they mention that they're doing it very carefully, because you've
got to make sure that when you decompress the new optimized image that it
pixel for pixel matches the original decompressed image.

~~~
jharsman
That's not how lossless compression of JPEGs work.

Besides removing information from the file that doesn't affect the rendered
image (like EXIF data), lossless recompressors typically replace the huffman
coding of DCT coefficients with a more efficient arithmetic coder. So you
don't start over from raw pixels, but you replace the type of compression used
with a more modern and efficient algorithm. That means ordinary software can't
read the JPEG (since you've essentially created a new format) but you can just
decompress into standard JPEG whenever someone wants to look at the image.

~~~
rlanday
> Besides removing information from the file that doesn't affect the rendered
> image

You can do this if the goal is pixel perfect accuracy, but Flickr can’t do
this since they have “a long-standing commitment to keeping uploaded images
byte-for-byte intact”…

~~~
londons_explore
I bet a lot of those ICC color profiles are the same across many images
though... One you could strip the metadata and keep it in a separate
deduplicated database, and then reassemble when the user accesses the file.

------
keeganjw
Wow, this is really impressive! A whole year without any new storage for such
a large enterprise is downright miraculous.

~~~
WatchDog
Hard to really judge without knowing how much excess storage they started the
year with.

~~~
londons_explore
Some large organisations store as many as 15 copies of each bit of user
data...

When you have a database which keeps redundant master slave replicas and
mutation logs, a backup system which keeps many previous backups on-site and
offshore on tapes, a storage system that has many RAID mirrors per host, a
distributed filesystem which stores replicated chunks of data across an entire
datacenter to handle host failure, and you keep copies of the entire lot in
multiple datacenters for accessibility during planned downtime and natural
disasters.

------
natch
Another way to reach this goal is to make sure features of the service remain
unfriendly to users. Case in point, when providing a search interface, only
return as many results as you want to, not as many as you have. And don't
bother coalescing spammy results all from the same account into one expandable
item; instead let them flood the results because they used keyword tag spam,
and then cut off the search results after a few pages.

Make it hard to navigate by hiding everything behind hashes, to prevent fair
use downloads. Keep tags in beta for 15+ years.

Of course, when usage goes down, that helps with the problem quite a bit. A
poor experience, even for viewing content, lessens engagement and leads to
lower usage and fewer uploads.

Sadly, I'm afraid a much more extreme data storage reduction approach awaits
faithful users of Flickr.

When Yahoo! bought a large photo blogging site in Taiwan, it simply shut it
down with about six months notice, deleting everything as it did.

~~~
superic
> Case in point, when providing a search interface, only return as many
> results as you want to, not as many as you have.

Do you have an example of this? Seems like a bad way to run search.

> instead let them flood the results because they used keyword tag spam

Tag spam largely doesn't work but it's not impossible. Flagging these results
makes them go away and brings them to our attention.

> Make it hard to navigate by hiding everything behind hashes, to prevent fair
> use downloads.

Huh?

> Keep tags in beta for 15+ years.

Double-plus huh? Flickr is only 13ish years old. It's not ready to get a
drivers license yet.

> Sadly, I'm afraid a much more extreme data storage reduction approach awaits
> faithful users of Flickr.

Nope! Unless you know something I don't?

> When Yahoo! bought a large photo blogging site in Taiwan, it simply shut it
> down with about six months notice, deleting everything as it did.

That sucks. Which one?

~~~
natch
無名小站 (wretch.cc)

------
git-pull
I'll tell you what Flickr isn't spending money or time on: Support.

I have a flickr pro account from 6 or so years ago with hundreds of photos on
it. I've tried over 10 times over a year to contact their support and get
turned over to Tech Support in India that won't even read into your case!

Of course, the original email address I used for my flickr was deleted, so
none of the avenues on Yahoo Help (which is where they redirect you) work. Not
to mention the password may be reset after all the leaks Yahoo had.

So when I see these people on @FlickrHelp on Twitter (No replies) and Flickr
having office parties, it really makes me feel quite disappointed! Yeah sure,
real human touch! Former paying customer who just wants to login his account
with tons of priceless photos. And they have a thread of like thousands of
people who can't get into their accounts [1]

At least the employees are having fun with data compression. Sad I can't talk
to an actual human to get access to my account!

[1] [https://www.flickr.com/help/forum/en-
us/72157668446997150/](https://www.flickr.com/help/forum/en-
us/72157668446997150/)

~~~
acomjean
I have the same issue.

Fortunately I can log in if I remember just correctly what email I used for my
Yahoo! account. my browser fortunately remembers and the credentials in
lightroom still allow my to put stuff up there.

~~~
nemeth88
If your browser has the password saved then you can extract it now and save it
to a more secure place, like a password manager. Presumably you could then use
it to update your Yahoo! account email to a current, working address.

As noted here - [http://stackoverflow.com/questions/30013032/prevent-user-
to-...](http://stackoverflow.com/questions/30013032/prevent-user-to-find-
password-through-firebug-chrome-dev-tools) you can get it using your browser's
web inspector dev tool.

~~~
claudius
FWIW, Firefox lets you view the stored passwords directly
(about:preferences#security → Saved Logins → Show Passwords). Very useful
feature :)

------
malisper
For a less ad hoc approach to reducing storage costs, I suggest looking into
the ZFS filesystem. Compression is completely transparent in ZFS. Once you
enable compression in ZFS, all of your files will automatically be compressed
when written, and decompressed when read.

I am currently managing a Postgres cluster with a petabyte of data in it. We
found ZFS to be a great way to reduce overall storage costs. We just switched
our machines to machines running ZFS, and we were suddenly using 1/3rd the
amount of disk space. Although it took us a while to learn all of the gotchas
of ZFS, it wound up saving us a huge amount of $$$.

(As I understand it, ZFS would not have helped in Flickr's case. Since JPEGs
are already compressed, ZFS would not have provided any benefit. Flickr was
able to save storage by using an ad hoc compression algorithm.)

~~~
devty
The article suggested that they had to carefully balance the processing cost
of compressing and decompressing data and the storage cost.

Did the use of ZFS in your system incur noticeable processing cost in your
system? Any noticeable increase in latency on your system?

~~~
malisper
When we first switched to ZFS, we had some ingestion issues. It turned out the
problems were caused by poor architecture decisions we made and was merely
exaggerated by ZFS. So yes, there was a noticeable increase in latency, but
ZFS was only a tiny piece of it.

------
amelius
> "There are several accepted resize algorithms, but to retain the Flickr
> “look”, we implemented the same Lanczos resize and kernel sharpening
> algorithms that we’ve used for years in CUDA."

How exactly is the Flickr "look" defined?

~~~
bahro
Flickr's resized images have a characteristically (over)sharpened appearance.

~~~
amelius
Does that mean that any artistic depth-of-field effect is (partially)
eliminated by the Flickr look?

~~~
roywiggins
Ordinary image sharpening tends to just make existing edges look sharper; dof
blurriness obliterates the edges. Strong sharpening on a strongly DoF-blurred
image might actually make the visual contrast between in-focus and out-of-
focus areas stronger.

------
harryf
> Peter Norby, Teja Komma, Shijo Joy and Bei Wu formed the core team for our
> zero-storage-budget project. Many others assisted the effort.

Looks like someone's hoping to get hired

------
blakesterz
Interesting read... They're just using "S3 costs" as a comparison we can all
understand? They don't use S3 do they?

~~~
twiss
The first link in the article [1] says Yahoo deploys Ceph internally for
object storage.

[1]: [https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-
ob...](https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-
object-storage-at)

------
WhitneyLand
\- h.265 would help a lot, eventually this (or equivalent patent free codec)
will be mainstream for storage and devices.

\- They say cost is 0.03/GB? Doesn't backblaze b2 charge 0.005/GB? Why isn't
B2 a real option?

~~~
adventured
Can B2 reliably handle something the size of Flickr, with its consumer access
demands (vs an enterprise/business focused service with far lower requests and
bandwidth usage), as opposed to knowing for certain that AWS can? That has to
be a rather dramatic consideration for a service the scale of Flickr. It's a
near certainty that AWS isn't going away for the next decade at least; the
same cannot be said with even remotely the same confidence for Backblaze. It's
the difference between 99% and 90% confidence that your provider is going to
still be around and offering a high quality service ten years out.

~~~
toomuchtodo
Its very likely Backblaze is going to outlive Flickr.

------
devty
Is there a name for an exploit where a malicious client requests rarely-
accessed contents that has been tucked/compressed away in order to overwork
their server?

~~~
lobster_johnson
Cache pollution.
[https://en.m.wikipedia.org/wiki/Cache_pollution](https://en.m.wikipedia.org/wiki/Cache_pollution)

------
Rebelgecko
Is anyone else confused by the baseline thumbnails and current thumbnails bar
graph? I'm not really sure what it is measuring

~~~
ascorbic
The file sizes of the different thumbnail sizes. Baseline shows all the
different sizes they used before, and current thumbnails shows the space saved
by switching to just two thumbnail sizes.

------
ComodoHacker
>On a very high-traffic day, Flickr users upload as many as twenty-five
million photos. These photos require an average of 3.25 megabytes of storage
each, totalling over 80 terabytes of data.

>increasing camera resolution, burst mode and the addition of short animations
(Live Photos) have increased bytes-per-image rapidly enough

>Users only rarely delete or change images once uploaded.

I'm very curious, how much of all this tr.. sorry, sweet memories are _never
ever viewed_ after, say one week from upload.

------
rocky1138
Regarding the lossless JPG compression change: the review strategy... was that
done manually by eye or automatically using some sort of image comparison
library?

~~~
jsjohnst
Both

~~~
rocky1138
That's neat. I wonder if they'd consider expanding the details around how they
did that.

------
Insanity
This is really interesting to read! Also quite surprised by some of the
statistics, I had no clue Flickr still saw this much activity.

------
gist
> We’ve pencilled out a model where we place one copy on a slower, but
> underutilized, tape-based system while leaving the other on disk.

Store images on tape? What about degradation of the tape overtime? Certainly
seems to be a factor compared to hard drive degradation.

~~~
JaggedJax
I have no personal experience, but think an LTO tape would last longer than a
HDD for long term storage. LTO tapes are rated to 30 years of storage and I
doubt anyone would be using an HDD to store data that long.

~~~
londons_explore
With drives, you are typically checking every few hours that data is readable.
If it becomes unavailable, you create a new replica right away.

With tapes, even though they might last longer, you typically wouldn't scan
and check the data as regularly. That means, if it does go unavailable, there
is a larger window of time for other replicas to fail in before re-replication
completes.

------
pmlnr
Optimization done right.

After a certain point, datacenter growth (both physically and logically) gets
so brutal that you need to consider running things more efficiently.

~~~
imaginenore
They just postponed the inevitable. You can't optimize significantly forever.
They will have to start getting more storage soon.

~~~
wpietri
Every meal you eat is postponing the inevitable. But I imagine you still think
it's worth doing.

------
Sargos
This is fascinating stuff. I would love to see what the Google Photos team is
doing in detail (but that will likely never happen).

~~~
ec109685
Google uses lossy compression on its images, so it has more flexibility than
Flickr does with regards to optimizations.

~~~
webmaven
Google uses lossy compression on it's unlimited free "high quality" image
storage[0].

Storing original uncompressed images eats into the 15GB free storage budget,
and uses paid storage upgrades after that.

Unless you have a Pixel phone (free unlimited original uncompressed image
storage).

[0]
[https://support.google.com/photos/answer/6220791](https://support.google.com/photos/answer/6220791)

------
tehlike
i wonder if they considered decompressing on the client. not that it's a great
way especially on mobile, but i was curious how numbers played out :)

------
siavosh
I hope these engineers get a few million $ bonuses.

~~~
photogrammetry
I'm sort of surprised to see you're getting downvoted by all the greedy wanna-
be zillionaire founders. If they had any brains, they'd _be_ the ones
developing this compression code and getting the million-dollar bonuses.

~~~
siavosh
Hah yeah I was surprised by the downvotes. My point was that if you were in
finance, these sorts of numbers generate massive bonus'...but not in other
fields sadly.

------
acd
Switching over to erasure based storage such as Minio could bring down the
cost even further.

[https://www.minio.io/](https://www.minio.io/)

~~~
londons_explore
I want to see a distributed erasure coding system. For example, the data is
distributed across 10 datacenters, and every object is available in one
datacenter without requiring slow costly network roundtrips, but if that
datacenter becomes unavailable it can be recovered from a combination of the
other 9.

~~~
rakoo
I'm not aware of any company doing that (I know backblaze does that but at the
pod level, not at the datacenter level... because they only have one)

You may be interested in Tahoe-LAFS though ([https://tahoe-
lafs.org/trac/tahoe-lafs](https://tahoe-lafs.org/trac/tahoe-lafs)). It has
many good things in it, one of them is that all files get erasure-encoded so
that k nodes out of n are needed to restore the file. When you set a node to
be a storage provider (such as S3, GCS, ...), then you effectively have
erasure encoding over providers: If S3 is down, you can still retrieve your
data from the rest of the providers.

Some people have actually tried to do it, as described in the first section in
([https://tahoe-lafs.org/trac/tahoe-lafs/wiki/TipsTricks](https://tahoe-
lafs.org/trac/tahoe-lafs/wiki/TipsTricks)).

------
tiffanyh
TL;DR; the longtail (rarely accessed images) becomes really expensive. So to
save storage (and thus cost), both highly compress and dynamically generate
rarely accessed images.

~~~
bbcbasic
I always wondered if YouTube hates people who put lots of unlisted family
videos for friends that get say 10 or so views. It's costing in them storage
but with virtually no ad revenue.

~~~
digler999
disks have hit 10tb per 3.5" disk, which means you could have _petabytes_ in
each rack. I think we're at the point now where the materials are almost
"free" compared to the amount of information you can fit on them.

Plus, they mine the shit out of that data. Even if they're not earning ad
revenue on it, they're tracking location, usage statistics, I wouldn't be
surprised if their user agreement includes the ability to have AI "watch" the
video and try to mine data about what's happening in the video the same way
they mine your email to build smarter and more effective consumer profiles
about you. So even with nobody watching your video, I wouldn't put it past the
"G-men" to find a way to eke some profit off it.

~~~
motoboi
According to Parkinson's Law, data will always expand to fill available
storage.

We may have petabytes for rack, but 4K videos are coming.

wiki.c2.com/?ParkinsonsLaw

~~~
Dylan16807
> 4K videos are coming

Yes, but screens increase in resolution very slowly, and you're physically
limited in how much resolution you can get on a small sensor.

And video compression is improving quite impressively.

So in the end, storage is growing a lot faster than the bitrate of videos.

I'm sure there will be plenty of demand, it comes through changes in the way
we use things.

~~~
brokenmachine
_> So in the end, storage is growing a lot faster than the bitrate of videos._

Is that true? I've been pricing hard drives recently, and they seem to have
stagnated for the last few years, staying at about the same price, with the
best value per Gb at around 4Tb.

I thought h265 took about half the space of h264 for the same quality, but a
4k screen is actually 4x the pixels of 1080p...

So I'm guessing 4k h265 would be about double the bitrate of 1080p h264.

~~~
Dylan16807
The price of hard drives is still dropping, there were just some setbacks from
the flood.
[http://www.jcmit.com/diskprice.htm](http://www.jcmit.com/diskprice.htm)

More importantly, don't just look at a year or two. 1080 availability is
significantly more than a decade old. In fact, you should probably be
comparing against 1080 mpeg-2. That gets you nearly the _same_ bitrate between
then and now. In that time hard drives have gone from ~1GB per dollar to 35GB
per dollar.

------
irfanka
So Indians are not "actual humans"?! Wow...

~~~
pyre
While the wording may not have been the best, the "Indian call centre" is just
a place where people have been turned into drones (these exist in the US too,
but the fact that it's been out-sourced to India reflects more on the "cost-
cutting" mentality around support that the company itself has). They have a
script to follow and not much agency to do anything (and possible penalties
for escalating issues rather than "resolving" them at the first tier of
support).

~~~
the_duke
To be fair, most issues that call centers deal with are rather simplistic, and
don't really justify employing highly qualified people there. Apart from those
not being interested in that job anyway.

Of course I'm mad too when I have to deal with a clueless call center rep,
especially when they don't move you up to higher tier support even if they
have no idea how to handle the issue. But as long as they do that, I'm fine
with the concept.

~~~
pyre
The problem is that every issue that is escalated costs _more_ money to
resolve, so there is pressure from managers to make sure that the number of
escalated calls is minimal. The reality is that this is a horrible metric to
go by. The issue that a caller wants resolved it what dictates the level of
support that they need, and the company really has no control over the volume
of calls that require more than tier 1 support (other than building a decent
product, I guess).

------
sean_patel
> get turned over to Tech Support in India

How do you know they are in India? Accent? Asking because India offshore
support peeps used by Dell, Walmart etc are all give white christian names -
like Mary, John, Adam etc - and also undergo 3 months of rigorous 'American
Accent' training. I know because 2 of my Indian cousins ( I am American-Indian
born and raised here) work at such call centers in Mumboi and Chenna
respectively.

So it's quite difficult to discern that they are Indian cos the Companies that
hire them spend millions of $ trying to disguise their voice and tone to make
them sound like they are local / American.

~~~
krisroadruck
be a whole lot cooler if they'd spend those millions just hiring Americans in
the first place. 3 months of voice training isn't going to trump having
english as your first/primary language your whole life, and those voip
connections are the worst.

~~~
t0mas88
Seriously? I would bet that there are more Indians speaking English and
willing to work a call center job than Americans.

The whole "Hire Americans!!1!" kind of viewpoint smells like Trump and
honestly: The world deserves better.

~~~
krisroadruck
I'm about as anti-trump as they get my friend. The notion that the world
deserves better than having native tech support however is something I
disagree with. When you are frustrated to the point of putting up with sitting
on hold to get help, the last thing you want to be confronted with is an
understanding barrier. Between accents and regional colloquialisms this is a
likely reality of foreign tech support.

Also, your assumption is wrong. 95%+ of Americans speak English while only
12.5% of Indians speak it. Even with India's much larger population that still
amounts to roughly a 3rd as many compared to the U.S. There is also a knock-on
effect. When only 12% of the population speak a language they tend to avoid
using it as a primary conversational language and thus only get practice in
professional settings which lowers their fluency in it. Add to that all of the
regional oddities and its quite different from American English. Consider for
example phrases like "Kindly do the needful" or "out of station". Phrases they
say daily that are completely foreign to American ears.

When you next have a chance compare someone in the Philippines (92% english)
speaking English vs someone from anywhere in India speaking English and it'll
become very apparent to you that the percentage of the population that speaks
the language makes a huge difference in the intelligibility of said speaker.

~~~
hehheh
I think you missed one of the points. How many of those 95% of Americans want
to/are willing to work in a call center?

~~~
imron
Millions?

[https://www.statista.com/statistics/580586/states-with-
the-h...](https://www.statista.com/statistics/580586/states-with-the-highest-
employment-call-centers-us/)

------
revelation
Gotta get the capex down for new management moving in.

------
omarforgotpwd
Well sure, it's not so hard to go a year without adding another byte of
storage if you're Flickr ;)... let's see Instagram or Facebook do it. Are
people even still uploading things to Flickr?

~~~
laurentdc
I think it's a different demographic.

Photographers still use and make projects or series on stuff like Flickr or
Behance.

Instagram and Facebook are more "blog-like" and aimed at snapshots of daily
life, holidays etc to share with friends. It's normal imho that these are much
higher in volume.

