
Apple downloads ~45 TB of models per day from our S3 bucket - julien_c
https://twitter.com/julien_c/status/1173669642629537795
======
cs702
"Almost everyone" working on NLP uses one of hugginface's pretrained models at
one point or another, sooner or later:
[https://github.com/huggingface/pytorch-
transformers](https://github.com/huggingface/pytorch-transformers)

It's so damn convenient, and so nicely done.

And they keep doing neat things like this one:
[https://github.com/huggingface/swift-coreml-
transformers](https://github.com/huggingface/swift-coreml-transformers)

Kudos to Julien Chaumond et al for their work!

~~~
BlueGh0st
For anyone else initially confused, NLP in this context is "Natural Language
Processing."

~~~
ShteiLoups
As opposed to?

~~~
alanh
neuro-linguistic programming, for one

------
nurettin
This is probably apple's continuous integration tests, lazily written to
download the whole thing every time someone merges a commit.

~~~
mister_hn
that's really stupid. I mean, I would have set a cache repository (SonaType
Nexus maybe?), download everything there and use that repository. In the
tweets the author says they've blocked the download from Apple IPs, so now
their pipeline is broken.

~~~
bhaak
I bet that 95% of all in-house CI would break if it doesn't have access to the
internet. I also bet that 95% of those wouldn't need to have access if they
were properly designed.

We rarely hear about CI servers being taken over but it has to happen
frequently enough.

~~~
mickeyp
Most enterprises use Artifactory or something similar -- or should anyway.
Once you start enforcing "no internet for CI" you start to see how poor some
ecosystems are. I'm looking at you, Javascript ecosystem packages, with your
hardcoded mystery URLs that you sneakily download artefacts from...

~~~
PopeDotNinja
We had a situation at work where some developers really got attached to
downloading npm modules at container start time to make builds fast. But then
starting the containers took 15 minutes :P

------
CobrastanJorji
If you host large, publicly available data in a cloud blob service, but you
don't have a budget for it, one option is to use the "Requester Pays" feature
that Amazon and Google provide. This makes the data available to anyone to
download, but they need to pay the download cost themselves.

This is at the tradeoff of making your data significantly more irritating to
access, as it's no longer just plugging in a URL into a program, plus everyone
who wants your dataset needs to set up a billing account with Amazon or
Google.

~~~
baroffoos
Or just post a magnet link.

~~~
CobrastanJorji
Sure, that's a great option for helping to reduce the cost for well-meaning
general use, but the other way makes your costs 100% predictable, which is
great if you're on an academic budget (but, again, way more annoying for the
downloaders unless they're also using AWS).

------
dharmon
I don't have high hopes for his business prospects if this is how he handles
one of the richest companies in the world clearly having a high need for
something his company offers.

Maybe spend less time on Twitter and more on your business model?

~~~
httpz
They're basically bragging they have something Apple really wants. Now they
have a bunch of people at least interested in what they got. I'll say that's
not a bad PR.

~~~
joatmon-snoo
Also a great way to throw out massive red flags to any enterprise user that
cares about privacy and non-disclosure.

IP address data is pretty sensitive information, and throwing it out there
like this, even in aggregate, is not OK because of what it shows.

No matter how much PR this gets, this goes both ways.

~~~
mariomariomario
This isn't sensitive information. Anyone with a BGP session can have this
information. [1]

[1] = [https://bgp.he.net/AS714#_prefixes](https://bgp.he.net/AS714#_prefixes)

~~~
vonseel
Pretty amazing that a single company can own an entire block of IP space, if I
understand this correctly. Approx how many addresses is this?

~~~
brodie78382
I used to work for HP and someone explained to me a select few companies got
/8's when the internet was still young. HP got one, Compaq had one which HP
now also owns. I was basically told if you had a /8 you didn't give it up
because of how valuable and rare they now are (this was around 2010, too). GE,
Kodak, Apple, and Microsoft were a few other names that came up in that
discussion as well.

~~~
flotwig
GE actually sold their entire 3.0.0.0/8 block off to AWS a few years ago.

It's a little awkward since a lot of internal software is still configured to
whitelist all access from that space since it was a constant for so long.

~~~
owenmarshall
We called this threxit internally. (Get it? Three dot exit? ;))

And as far as I know we haven’t stopped threxiting — at least they hadn’t when
I left. It turns out unwinding IT systems that have had stable IP addresses
for 30+ years in a year or two is tricky business.

------
emeraldd
This looks kind of interesting:

[https://github.com/huggingface/pytorch-pretrained-
BigGAN/blo...](https://github.com/huggingface/pytorch-pretrained-
BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L29)

When you look further down you find:

[https://github.com/huggingface/pytorch-pretrained-
BigGAN/blo...](https://github.com/huggingface/pytorch-pretrained-
BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L253-L281)

And that's just a quick search for s3 in the repo. It would not surprise me in
the least to discover a `from_pretrained` that points at one of the s3
resources being pulled. There's probably other stuff like that as well in the
code that could be causing equally nasty heartache .. especially if non-
persistent containers are involved....

(This is a WAG aka Wild A __Guess)

EDIT: Dug a little more and found:

[https://github.com/search?q=org%3Ahuggingface+s3&type=Code](https://github.com/search?q=org%3Ahuggingface+s3&type=Code)

Unless I'm mistaken here, there's a crap ton of code that could be downloading
models at runtime ... Which seems significantly less than ideal ...

~~~
madisonmay
It's also possible it's part of a docker build step or similar. Even if
they're aren't downloading models at run time they may be loading s3 if their
pytorch-transformers lib docker cache gets invalidated frequently.

~~~
kortex
Docker builds are crazy wasteful in terms of bandwidth and compute. Right now
I'm struggling with a project that builds ITK on demand every heckin time.

I'm working through how to best integrate apt-cacher-ng, sccache, and a pip
cacher. It costs me nothing to hit apt or pypi, but like, somebody is paying
that bill. A little perspective goes a long way.

I wonder if I could do something to just proxy _all_ requests and cache those
on a whitelist and stick it on my CI network.

------
btown
A brief reminder: Whenever you publish code or documentation that might be
used/scraped by the outside world, ALWAYS use a domain you own. If you're on
Cloudflare you can instantly (and for free) create Page Rules to use
Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic
anywhere you want.

~~~
nathantotten
Not to mention that if Cloudflare CDN was in front of it this traffic would be
free.

~~~
qes
Not with this amount of traffic it won't be free. 45 TB per day lol Cloudflare
will be disabling your account and in contact for payment in a hurry.

Go ahead and try, see how far their "free" tier really goes.

~~~
vxNsr
Just waiting for the Cloudflare CEO who lurks around to pop in here and offer
it for free.

------
lacker
Well, you could contact them and make a very-likely-to-succeed case that they
should pay you some money, or you could complain about it on Twitter.

~~~
slenk
Twitter will probably be a faster response than automated email inboxes at
Apple

~~~
tinus_hn
The executive team email addresses that can be easily found are monitored
quite well.

------
paxys
That's about $4000/month in bandwidth costs, assuming retail pricing.

FYI he is bragging, not complaining. There are a dozen ways to reduce or
eliminate this problem.

~~~
qes
> That's about $4000/month in bandwidth costs

You're an order of magnitude off.

45 TB per day is 1,350 TB in a month, or 1,350,000 GB.

Show me somewhere you can get a petabyte of egress inside a calendar month for
4 figures USD...

Let's suppose you even used the cheaper egress from Cloudfront rather than
serving from S3 (lol @ your wallet if you serve 1 PB doing that).

[https://aws.amazon.com/blogs/aws/aws-data-transfer-prices-
re...](https://aws.amazon.com/blogs/aws/aws-data-transfer-prices-reduced/)

The first petabyte costs an average of $0.045 per GB - $45,000.

The remaining 350 TB costs another $10,500 for a total of $55,500.

Serving off S3 directly? Yeah, that'll be more like $150,000.

~~~
pzmarzly
> Show me somewhere you can get a petabyte of egress inside a calendar month
> for 4 figures USD

Correct me if I'm wrong, but most colocation/dedicated server providers offer
such prices. E.g. hetzner.com @ €1/TB, sprintdatacenter.pl @ €0.91/TB,
dedicated.com @ $2/TB (or $600/month for an unmetered 1 Gbps connection).

But if you want S3/CDNs/<insert any cloud offering here>, then yeah, they're
expensive.

BTW per Cloudflare ToS[0]:

> Use of the Service for the storage or caching of video (unless purchased
> separately as a Paid Service) or a disproportionate percentage of pictures,
> audio files, or other non-HTML content, is prohibited.

[0] [https://www.cloudflare.com/terms/](https://www.cloudflare.com/terms/)

~~~
MrStonedOne
You confused cloudfront with cloudflare

~~~
pzmarzly
True, I'm not that familiar with Amazon offerings (only ever used it for
Windows GPU instances) and reading too fast got me. Too late for an edit.

------
ebg13
If you don't want someone else to do something that costs you money, you're
going to have a bad time if you don't prevent them from doing it.

~~~
Topgamer7
Amazon has the ability to charge the requester. Pass the buck on.

~~~
mtnGoat
i dont think this works on public files though, does it?

~~~
idunno246
This is the main purpose of the oft maligned allauthenticatedusers permission
in the s3 console - anyone can read it they just need an identity to pay

------
alphagrep12345
What does hugging face do? Do they implement models from papers and make them
available for free?

~~~
physicsyogi
Yes. I don’t know if that’s all that they do. They often port new Tensorflow
models to PyTorch as well. They provide straightforward APIs, nice
documentation, and clear tutorials. I use their stuff pretty regularly.

~~~
codesternews
Do you know their business model? Looks like they are open source company. How
they earn money?

~~~
rasz
Their business model is someone like Apple buying them out.

------
rhacker
I'm guessing someone at apple internally distributed a dockerfile that pulls
that down.

------
fitzroy
In a few weeks he can just point Apple's IP range to a shared iCloud folder.

~~~
dewey
That'll get them them Apple's attention pretty quickly I'd assume.

~~~
elcritch
Now that’d be a bit ironic to get a "your account is using excessive data
bandwidth " and be able to respond with "incorrect, _your_ ip’s are using too
much bandwidth ".

------
StreamBright
Paid by requester is the feature they are looking for.

[https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-
re...](https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-requester-
pays-console.html)

~~~
martin-adams
That's a very neat feature I never knew existed. I only suspect this will hurt
their larger mission at helping many smaller teams and individuals to use the
models.

~~~
StreamBright
Not really, if you are a small user your cost is negligible. You can calculate
how much it would be for a small team but my guess is couple of dollars per
month. Apple's use case is still very reasonable and the cost for them also
not as bad. They could also split out different customers to different buckets
and have big guys pay for it while smaller companies have it for free. There
are many options.

~~~
martin-adams
Yeah that makes sense. I was thinking in terms of ease of access. If a large
organisation makes everyone pay, that means everyone has to arrange accounts
and payment methods.

By splitting it out you have no guarantee that the big guys will just use the
free one.

------
hi41
I read the Twitter post but did not understand what is happening. Can someone
please explain.

~~~
ProAm
Apple employees are using their product, downloading lots of data, not paying
for any of it, and the OP doesn't like it or can afford it.

~~~
microtherion
I don't think it's employees as such — even Apple does not have THAT many
machine learning people, and they wouldn't download models daily.

Maybe a server farm, where each instance downloads a model when spinning up?

~~~
hinkley
In the last days of my time spent in the XML salt mines, I got in on a
conversation with the web masters at w3.org.

You would not believe how many people and how many libraries pull from primary
sources directly instead of using local copies of common resources. I found
this conversation because I'd just finished fixing that in our code and taking
about 5 minutes off the build process.

Let me restate that: We were spending 5 minutes just downloading schema files.
In an automated build. Every time, sometimes on several machines at once.

At one point we were trying to convince him that intentionally slowing all
requests down by say 500 ms would get the attention of people who were
misbehaving. Anyone who was downloading it once and caching would hardly
notice the 500 ms. Those running it once per task would be forced to figure
their shit out.

~~~
angrygoat
I sometimes wonder about all the additional HTTP load to places like
debian.org that must come from Docker builds; I don't have any kind of caching
in mine, and every single commit causes CI to go and build images and run
tests on them.

It's not an issue I really see, but it seems to me that CI infrastructure
(Travis, ...) really should have caching proxies in place.

~~~
hinkley
Several places I've worked have pushed Artifactory on us for various reasons
from security to resilience to network outages to bandwidth saving.

It's not the worst fate.

------
cpach
Isn’t this a use case where BitTorrent would shine?

~~~
michaelt
The problem here is "Someone's CI pipelines redownload the same models on
every build"

I'd say there's only a 10% chance Apple's firewall would let BitTorrent
through, and only a 3% chance the CI servers would maintain a positive seed
ratio.

Possibly it might solve the problem because users would cache the resources
themselves to avoid the hassle of getting BitTorrent into their CI pipeline...

------
peterwwillis
If your CI/CD is re-downloading and re-building everything on every single
run, you are not only being wasteful, you're actually more likely to have an
outage due to not storing dependency artifacts needed for deploy. Use a local
artifact store to be more resilient to failures of servers you don't control
(and also save everyone money and time).

------
soared
Charge, them, money?

------
jijji
Hosting terabytes of data on an S3 bucket where people would download 45TB per
month ($0.023/GB == $1000+/month) sounds like a really expensive way to
distribute your data to people...

~~~
xfitm3
You could configure buyer pays, if you wanted to.

~~~
rtkwe
That makes it harder for everyone though where companies like Apple should
proxy and/or cache those requests to their own internal version rather than
hitting that S3 bucket every time. Requiring requester payment would mean it
would only really be used by corporations where the author clearly wants a
service open to anyone without having to open an AWS account to pay.

~~~
thenickdude
Make the bucket buyer-pays, but offer Torrent links as well. Businesses doing
CI will pay for S3 usage so they don't have to deal with torrents, end-users
will get free torrent access, everybody wins.

~~~
Hitton
I just skimmed AWS requester pays and it seems that you can't set the price,
the requester always pays only the AWS's download cost, so why bother setting
it as requester pays at all if it doesn't bring you anything?

~~~
corint
It brings you the absence of GET requests and the bandwidth charges on your
AWS bill. In this case, 45TB a day is going to add a lot to an AWS bill, you
can shift that cost to the user that's downloading the files from you.

------
mrfusion
What’s the backstory on this? (Is it something I should already know)

------
yalogin
Isn't it likely that someone wrote a script for testing some regression and it
keeps running in a loop? I can almost bet that will be the case.

------
tnolet
Is this what they call product market fit?

------
codesternews
Looks like open source company. What's their business model? Does any one
know, How they earn money?

------
idlewords
This is what success looks like if you charge money for a good or service.

------
ChuckMcM
And now the twitter post is gone? I'm guessing the west coast woke up and
someone at Apple said "Wait, you could infer some proprietary information with
that information ..."

~~~
Adirael
I can see the Twitter post without issues.

~~~
ChuckMcM
Yeah, the follow_up after clearing some caches it seems to work for me as
well. Interesting.

------
cuillevel3
Are those full downloads or just HEAD or range requests from some CI?

------
dlasek
They're the ones that made Amazon get those Data Trucks lol

------
z3t4
Apple are probably doing "continuous integration" where all assets are re-
downloaded from the Internet in each iteration. Tip: put your stuff on Github
:P

~~~
coleca
With models that large you would be paying for GitHub's LFS credits. Those
aren't cheap as I recall. Napkin math at $5/50GB of bandwidth per month for
45TB per day it would cost them $135k/mo to use Github. That's over 2x more
than the S3 egress charges would be.

------
ajay-d
Aren’t all the authors of that paper from Apple?

------
ecnahc515
Why can't they use cloudfront?

~~~
mirceal
cloudfront costs money too.

~~~
ecnahc515
For some reason I thought the prices for transfer were much better than S3,
but that doesn't look like it's the case. It's lightly better, at large
volumes.

------
half-kh-hacker
It's surprising that nobody here's mentioned Wasabi, since they have free
egress.

~~~
koolba
There's no such thing as a free lunch.

Wasabi's own docs[1] mention a ballpark definition of "reasonable" as monthly
transfer being less than total storage. Above that you'll get a call.

I'm sure they'd still be cheaper than AWS, but it's not going to be
$5/TB/month to service the entire world.

[1]: [https://wasabi.com/pricing/pricing-
faqs/](https://wasabi.com/pricing/pricing-faqs/)

------
master_yoda_1
So these jokers at apple publish a paper by using code from huggingface.

------
dymk
Need to distribute large static content? Looks like a good job for a torrent.

~~~
yellowsir
or ipfs using ipns ... so many good solutions, but people suggest bucket where
the downloader is paying... or cloudfear /sic

------
bryan_w
Have you considered cloudflair?

~~~
toomuchtodo
Backblaze B2 + Cloudflare = free outbound transfer (bandwidth alliance).

Making objects in S3 publicly available and then complaining about their
extortionate bandwidth charges is...silly.

~~~
reitzensteinm
If you try to serve a petabyte of binary data a month through Cloudflare,
they're going to shut you down.

S3 is very expensive, but averaging 4 gigabits 24x7 is not going to be cheap
anywhere.

~~~
parliament32
>averaging 4 gigabits 24x7 is not going to be cheap anywhere

A normal upstream (if you were in a colo or some sort of DC) will charge you
about $0.50 per meg. So, 4gbps would cost you... $200/mo. [1]

I think people don't realize just how much more expensive IaaS providers are.
We're literally talking orders of magnitude.

[1] [https://blog.telegeography.com/outlook-for-ip-transit-
prices...](https://blog.telegeography.com/outlook-for-ip-transit-prices-
in-2018)

~~~
reitzensteinm
$2000, it seems like you dropped a zero. And you're presumably going to need
some burst capacity, so call it $4k for a 2:1 ratio if you care about not
throttling at peak times.

The big cloud providers are very expensive, but you're getting impeccable
bandwidth. Low packet loss, burst what you want even in peak times, fast
transit worldwide, minimal variability.

What most people actually don't realize is that they're making that tradeoff
implicitly. You're not being ripped off if you need all of these things, but
if it's overkill you can get much cheaper bandwidth elsewhere.

------
kelnos
If a company the size of Apple finds this that useful, perhaps you should
consider charging for your service, rather than just complaining on Twitter
about the free usage you appear to have willingly given away?

Or perhaps you have reached out to them, but are for some reason still
complaining on Twitter to drum up PR or something?

Regardless, this posting is ridiculously context-free to the point of being
click-baity. (But hey, good job, I clicked on it anyway.)

~~~
JustFinishedBSG
"It's your own fault if you didn't foresee a trillion dollar company
exploiting the product you made free and open source to help researcher and
now are incurring 15K$ bills per month"

And then people wonder why people don't want to make their stuff free/open
source. Even when it's free people still think you're somehow entitled and
overcharging.

~~~
kelnos
I mean... yes? It doesn't have to be a trillion dollar company. You put tons
of useful data on a public S3 bucket and publicize it, and people are going to
download it. S3 data transfer isn't free, so I think it's reasonable to expect
that, over time, the cost of serving the data is going to be prohibitive
without funding it in some way.

> Even when it's free people still think you're somehow entitled and
> overcharging.

I just explicitly advocated for the opposite of that, so I'm not sure where
you're getting that.

